Which JSON data stores exist

Associative arrays

Data structure triangle for associative arrays

Associative arrays are one of the most important uses for search algorithms and search trees. Before we explain this in detail, however, we want to take another look at the data structure triangle from the first lecture, which can be illustrated very nicely using the example of associative arrays. We show it again here:

Recall that you have to choose two corners of the triangle in order to define a data structure. In the following we will show how Python defines the abstract data type "Associative Array" by specifying the permitted operations and their meaning, how the standard data format "JSON" is defined by specifying a memory layout and the meaning of the stored entities, and how efficient Implementation of the specified operations with the appropriate memory layout in each case, the data structure can be implemented in different ways.

Definition of the abstract data type

Associative arrays can be used just like normal arrays, so they support read and write access via the index operator. In contrast to the usual array, where the indices are whole numbers in the range must be, the type of the indexes can now any be. We use the term "key" for this:

a [key] = value # Save the value 'value' under the key 'key' value = a [key] # Read out the value saved under the key 'key'

A typical application is a dictionary

x = toEnglish ['Baum'] # results in 'tree'

In this case is the type of key. This is the most common case in practice, which is why associative arrays are often called Dictionary (also in Python, here the type is called). In general, however, any type can be used as a key that meets one of the following requirements:

supported comparison operations for keys possible implementation of the associative array
Identity test:
sequential search
Order relation:
or
Search tree (also binary search, if no new keys are inserted and none are deleted)
Identity test and hash function:
and
Hash table

If more is known about the key (an order relation or a hash function instead of a mere identity check), one can use correspondingly better data structures (search trees or hash tables instead of sequential search) whose access functions are much more efficient (sequential search is only possible in O (N) ).

In addition to the two access functions above, there are three additional functions in Python: one to test whether a key is present, one to delete a key and the data stored under it, and one that determines the size of the array (number of keys stored / Value pairs) returns:

if a.has_key (key): # Test whether key 'key' exists del a [key] # Remove key 'key' and associated data from the array print len ​​(a) # Output size of the array

The syntax of the functions listed applies to the use of an associative array. If you want to implement such a data type, you have to provide the corresponding functionality as methods of the respective class. The Python interpreter automatically transforms the index / key access as well as the and operators into calls to the respective method, as the following table shows. The complete definition of the meaning of the individual operations (as required by the data structure triangle) also includes the specification of the behavior in the event of an error (e.g. if a requested key does not exist).

surgery method call generated internally by Python Method signature to be implemented importance
* if already exists: replace the associated data with
* if it does not yet exist: create a new key and save it as the associated data
* if exists: return the associated data
* if does not exist: throw an exception
returns if exists, otherwise
* if exists: remove this key and the associated data
* if does not exist: throw an exception
returns the size of the array

Due to the definition, it is clear that each key can only appear once in the array. The definition of the abstract data type "associative array" allows us to implement such arrays in a wide variety of ways without changing anything in terms of usage (i.e. how the array functionality is called later).

The JSON data format

The second edge of the data structure triangle is the "data format". This is where you define the memory layout and meaning. A data format is primarily used to save data on the hard drive and to exchange data between different programs or websites. In the case of associative arrays, the JSON format is becoming more and more popular because it is simple and yet powerful. It is very suitable for storing associative arrays (i.e. key / value pairs) and also supports normal arrays and hierarchical structures, because the values ​​can in turn be (normal or associative) arrays.

The memory layout of a JSON file is defined as a byte sequence that is interpreted as a character sequence according to the UTF-8 standard. This has two advantages: on the one hand, the format is compatible with all common systems and is the same everywhere; on the other hand, every JSON file can be easily opened and edited in a text editor and is equally easy to read for people and machines.

A meaning is assigned to a given memory content in JSON with the help of a grammar. A JSON file contains either a normal array or an associative array (dictionary):

JSON_file: = array | dictionary

An ordinary array is written as a sequence of one or more elements, separated by commas and enclosed in square brackets (characters enclosed in single quotes in the grammar must be explicitly in the JSON file). Empty arrays are also allowed:

array: = '[' elements ']' | '[' ']' elements: = value | value ',' elements

Similarly, a dictionary is written as a series of key / value pairs separated by commas and enclosed in curly braces. Empty dictionaries are allowed. The keys must always be strings followed by a colon:

dictionary: = '{' pairs '}' | '{' '}' pairs: = string ':' value | string ':' value ',' pairs

Strings are character strings (including some special characters such as "\ n" for a line break) enclosed in double quotation marks, or the empty string:

string: = '"' '"' | '"' characters '"'

Values ​​can be numbers (whole or real numbers, defined as in Python), Boolean values ​​('true' or 'false'), strings, or 'null'. In addition, arrays and dictionaries can be used as values, which results in hierarchical data structures nested at will:

value: = number | string | boolean | zero | array | dictionary

Here is a simple example of a JSON file that contains excerpts from a student database:

{"Müller, Friedrich": {"Mathematics": 2.0, "ALDA": 1.3}, "Weise, Anna": {"Mathematics": 1.0, "Philosophy": 1.3}}

The JSON format is syntactically very close to the Python language, and with a few definitions it can be converted directly into a Python dictionary or array using the function.

# don't do this - it is highly unsafe and dangerous true, false, null = True, False, None # Define missing constants res = eval (file ("test.json"). read (). decode ("utf_8")) # Read in the file and execute it as Python code

One should, however no way because a hacker could execute arbitrary code that he had previously smuggled into the file 'test.json'. Since the function only checks whether the expression is valid Python, but not whether the file contains valid JSON (i.e. only data, but no executable code), this cannot be recognized or prevented. Therefore you should always use the Python module json to read in JSON, which would simply reject a manipulated file:

# Safe reading and conversion import json with open ('test.json', encoding = 'utf-8') as f: # Open file in UTF-8 format res = json.load (f) # and read in as json

Implementation of associative array classes

The third edge of the data structure triangle finally relates to the implementation of the data structure as a class by implementing the required operations on suitably organized memory. In Python, the class, a very powerful implementation of an associative array, is an integral part of the language. This implementation is based on the concept of hash tables, which we will deal with later in the lecture. You need a function that is already implemented in Python for all standard data types. In this section we want to consider two alternative implementations based on sequential search and based on search trees.

Realization through sequential search

If only an identity comparison is made for the keys

key1 == key2

is defined, there is no way to speed up the key search using a special data structure. You then simply store the key / value pairs in an ordinary array that you search sequentially. To do this, we first implement an auxiliary class that accepts key / value pairs:

class KeyValuePair: def __init __ (self, key, value): self.key = key self.value = value

The array class stores the pairs in an array, the current length of which is the size of the associative array. This defines the memory layout of the class:

class SequentialSearchArray: def __init __ (self): self.data = [] def __len __ (self): return len (self.data)

In order to access the data, we have to look for the right key. To do this, we implement the helper function that returns the index of the key or, if the key does not exist:

def findKey (self, key): for k in xrange (len (self.data)): if key == self.data [k] .key: return k return None

When inserting an element, we first have to check whether the key already exists and then either overwrite the data or add a new element:

def __setitem __ (self, key, value): k = self.findKey (key) if k is None: self.data.append (KeyValuePair (key, value)) # insert new pair else: self.data [k] .value = value # replace data

The search, on the other hand, throws an exception if the key was not found:

def __getitem __ (self, key): k = self.findKey (key) if k is None: raise KeyError (key) # key not found => error else: return self.data [k] .value # key found => data hand back

The other required functions are just as easy to implement:

def has_key (self, key): return (self.findKey (key) is not None) def __delitem __ (self, key): k = self.findKey (key) if k is None: raise KeyError (key) # Key not found => Error else: del self.data [k] # key found => delete

Because of the sequential search, access to an element in this data structure has the complexity O (len (a)).

Realization as a search tree

If an order is defined for the key type of the associative array (i.e. if or are supported), the indexing problem can be reduced to the search problem. Then the associative array can be implemented efficiently as a self-balancing search tree, so that the access functions only have a complexity in O (log (len (a))). The data structure of the search tree must be expanded so that the associated data is also saved for each key. The node class is therefore extended by a "value" field:

class Node: def __init __ (self, key, value): self.key = key self.data = value self.left = self.right = None

Then you can implement a class whose constructor initializes an empty search tree:

class TreeSearchArray: def __init __ (self): self.root = None self.size = 0 def __len __ (self): return self.size

The function checks whether an entry with the relevant key already exists. If so, its data will be overwritten with the new data, otherwise a new entry will be created. The already known functions and are used internally for this purpose (see section Search trees):

def __setitem __ (self, key, value): node = treeSearch (self.root, key) if node is None: self.root = treeInsert (self.root, key) self.size + = 1 node = treeSearch (self.root , key) node.value = value

(A more skilful implementation would of course eliminate the second call to and set the data straight away in. However, this does not change the complexity of the function.) The function also searches for an entry with the given key. If it is found, it returns the associated data, otherwise an error message:

def __getitem __ (self, key): node = treeSearch (self.root, key) if node is None: raise KeyError (key) else: return node.value

When implemented with a tree search, the index operations have a complexity of O (log n).

An important example of an associative array implemented in this way is the C ++ standard class.

next topic