Python 3.9 Updates: topographical sort and string manipulation

Nov 13, 2020 - 12 min read
Vincent Russo
editor-page-cover

Vincent Russo is an Educative course contributor who is a full-time software developer and runs LucidProgramming, a YouTube channel to help individuals improve their software skills and value as developers.

LucidProgramming has content centered on Python development where the topics covered include data structures, algorithms, web scraping, natural language processing, and many more.




On October 5, 2020, Python 3.9 was released, and with it, came a number of new features. One handy new feature is the addition of the graphlib module that now comes standard with Python 3.9.

At the time of this writing, graphlib only contains functionality for performing topological sorting, but it is planned to house a number of other graph algorithms to include in the Python standard library.

Some contributions have also been made pertaining to string manipulation. Python already has extensive native functionality for manipulating strings. In Python 3.9, there are further contributions and improvements that this new version of Python offers for all your string tweaking needs.

We’ll be taking a deep dive into what these new contributions are and how to invoke them in your own Python projects.

In this post, we will cover the TopologicalSorter component of graphlib and how to use the updated string manipulation in Python 3.9.

Today, we will go over:


Succeed in your Python interview

Review of the most common data structures and algorithms that you’ll see in interviews with hands-on, real-world questions.

Data Structures and Algorithms in Python


Topological ordering and Topological sort

Topological sort is an algorithm that takes a directed acyclic graph (DAG) as input and returns an ordering where each vertex appears prior to all vertices that it points to. We call this ordering a topological ordering.

More formally, quoting from Wikipedia:

“…topological sort returns a linear ordering of its vertices such that for every directed edge uvuv from vertex uu to vertex vv, uu becomes before vv in the ordering.”

To make the definition more concrete, let’s take a look at an example. Consider the following DAG:

             +---------------------+
             |                     |
             |                     v
+---+      +---+      +---+      +---+
| A | ---> | B | ---> | D | ---> | E |
+---+      +---+      +---+      +---+
  |                     ^ 
  |        +---+        | 
  +------> | C | -------+
           +---+
           
Figure-1: Example of a directed acyclic graph (DAG).

One can see that the graph in Figure-1 is a DAG as it contains no cycles.

The vertex labelled A in Figure-1 is connected to vertices labelled with B and C, so a topological ordering would be one where A comes before both B and C. Applying this same logic to the DAG for the remaining vertices, we find that one valid topological ordering is [A, B, C, D, E].

Topological orderings are not necessarily unique however, so it is possible to have more than one valid topological ordering. For instance the ordering [A, C, B, D, E] is also valid.

Topological sort is a useful tool to have in ones arsenal. One application of topological sort is for job scheduling. If you encode the vertices as jobs that need to be completed and the edges as dependencies for the jobs, performing a topological sort will indicate an ordering of jobs that can be completed without encountering any dependency conflicts.


Topological sorting in Python 3.9

The way that we encode the data in a graph in Python is by using the dictionary where the keys correspond to the labels of the vertices and where the values are a dictionary of labels that correspond to what vertices the key vertex is connected to.

To make the above more concrete, let’s take the DAG from Figure-1 and encode it in a Python dictionary structure.

# Encode the DAG from Figure-1 in Python dictionary.
graph = {"B": {"A"}, "C": {"A"}, "D": {"B", "C"}, "E": {"D", "B"}}

Taking this graph object that we constructed, we can perform a topological sort to see what the result yields

>>> import graphlib
>>>
>>> ts = graphlib.TopologicalSorter(graph)
>>> print(list(ts.static_order()))
('A', 'B', 'C', 'D', 'E')

We see that indeed, the result of performing topological sort on this graph agrees with our assessment in the previous section.

The usage of the static_order() method returns an iterable of nodes in a topological order. Converting that result to a list allows us to print and see the topological ordering of the graph object.

Note that if we define a graph that is not a DAG, say, a graph that contains a cycle, graphlib will throw a CycleError to indicate that the graph is cyclic and therefore cannot possess a topological ordering.

Consider the following cyclic graph.

+---+      +---+      +---+ 
| A | <--- | B | <--- | D | 
+---+      +---+      +---+ 
  |                     ^ 
  |        +---+        | 
  +------> | C | -------+
           +---+
           
Figure-2: Example of a cyclic graph.

We can confirm that encoding this graph in a Python dictionary and attempting make use of the TopologicalSorter will yield a CycleError.

>>> import graphlib
>>> 
>>> # Encode the graph from Figure-2 in Python dictionary.
>>> graph = {"A": {"B"}, "B": {"D"}, "C": {"A"}, "D": {"C"}}
>>> 
>>> ts = graphlib.TopologicalSorter(graph) 
>>> list(ts.static_order())
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/graphlib.py", line 241, in static_order
    self.prepare()
  File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/graphlib.py", line 103, in prepare
    raise CycleError(f"nodes are in a cycle", cycle)
graphlib.CycleError: ('nodes are in a cycle', ['A', 'B', 'D', 'C', 'A'])

As expected, we cannot obtain a topological ordering on a cyclic graph and seeing this CycleError indicates that.


Example: Course Scheduling

To see an application of where topological sort is useful and where we can make use of the TopologicalSorter from graphlib, consider the following problem “Course Scheduling” problem from LeetCode.

There are a total of num_courses courses you have to take, labelled from 0 to num_courses-1. Some courses may have prerequisites, for example to take course 0 you have to first take course 1, which is expressed as a pair: pre_reqs=[0,1].

Given the total number of courses and a list of prerequisite pairs, is it possible for you to finish all courses?

As an example, consider the input num_courses = 2 and pre_reqs = [[1, 0]]. There does exist a way in which to finish all of the courses. In this case, there are a total of two courses that need to be taken.

In order to take course 1, it is required that you must have finished course 0, so this is possible to achieve.

Alternatively, consider the input pre_reqs = [[1,0],[0,1]]. Here, there does not exist a way in which to finish all of the courses. One needs to take two courses. To take course 1, it is required that course 0 is finished.

Paradoxically, to take course 0 it is required that course 1 is finished. We have a catch-22 situation on our hands here, and thus, it is not possible to satisfy this criterion.

The course scheduling problem lends itself nicely to a setting whose relationships can be encoded in a graph. The vertices of the graph represent the courses and the edges between the vertices represent the prerequisites that must be completed prior to taking.

Given such a graph encoded in this manner, if we are able to obtain a topological ordering then it must be possible for a student to complete the required number of courses.

Alternatively, if we have a cycle in our graph, then we know that such a topological ordering is not possible and therefore the answer in that case would be “no”.

import graphlib

def can_schedule(graph):
    ts = graphlib.TopologicalSorter(graph)
    
    try:
        # The graph has a topological ordering.
        list(ts.static_order())
        return True
    except:
        # The graph does *not* have a topological ordering.
        return False
# Example: A valid topological ordering exists, and therefore
# it is possible to schedule the courses.
>>> graph = {0: {1}}
>>> can_schedule(graph)
True
# Example: A valid topological ordering does *not* exist, and
# therefore it is *not* possible to schedule the courses.
>>> graph = {0: {1}, {1}: 0}
>>> can_schedule(graph)
False


Continue learning about Python

This course is a detailed review of the most common data structures and algorithms that you’ll see in interviews. With implementation details, thorough explanations, and hands-on coding exercises, you’ll quickly gain the confidence you need to solve any problem, no matter the situation.

Data Structures and Algorithms in Python


String Manipulation: Prefix and suffix removal

PEP 616 provides two new methods,removeprefix() and removesuffix(), as built-in methods for Python string objects. As the function name sugests, removeprefix() removes a specified prefix string from a string object and removesuffix() removes a specified suffix string.

Let’s take a look at some examples:

>>> s = "I think Educative is awesome!"

>>> s.removeprefix("I think ")
'Educative is awesome!'

As we can see, the argument to the removeprefix() method is the prefix from the string s that we wish to remove.

If the prefix in question does not exist in the string, then the original string is returned as a result instead of any ValueError being thrown. For instance:

>>> s = "I think Educative is awesome!"

>>> s.removeprefix("I know ")
'I think Educative is awesome!'

In a situation where you want to know that there was indeed a prefix and it was removed, one could check that by comparing the length of the strings prior to and after the call to removeprefix(). For example:

>>> if len(s) == len(s.removeprefix(prefix)):
>>>     print("No prefix was removed.")

>>> else:
>>>     print(f"Prefix {prefix} was removed.")

Of course, in many cases it would be preferable for the function to “fail gracefully” and to only remove a prefix if one exists and to pass through if it does not.

The same basic idea applies to the removesuffix() method operating on strings, where of course the key difference for this method is that it removes the suffix of a string as opposed to the prefix. A small example for completeness:

>>> s = "Educative is awesome, I think!"

>>> s.removesuffix(", I think!")
'Educative is awesome'

One subtle point that should be mentioned is that if we did not provide the complete suffix, that is, part of the string that does not extend all the way to the end of the string, we would obtain the same initial string back as our result. Observe that:

>>> s = "Educative is awesome, I think!"

>>> s.removesuffix(", I think")
'Educative is awesome, I think!'

That is, since the string in the argument to removesuffix is not the complete suffix of the string s, it is ignored and instead, the initial string is returned back to the user.

There are some direct benefits to preferring removeprefix() and removesuffix() over doing something like making use of the replace() function that Python provides.

>>> s = "Educative is awesome, I think!"
>>> s.replace(", I think", "")
'Educative is awesome'

While the behavior in this case gives us what we want, behind the scenes the replace() method is more computationally expensive. We certainly wouldn’t notice any difference for examples of the size we have here.

But if, say, you are performing various string manipulations on large sequences of genes for a bioinformatics projects, the expensive nature of using replace() over either removeprefix() or removesuffix() could be much more noticeable, depending on the size of your data.

Another arguable benefit to the removeprefix() and removesuffix() functions are that they are more descriptive. Instead of using something opaque and fragile for the specific instance, it is more direct as to what the end result of running the computation on the string should be.

If you are revamping an existing Python project to be compatible with Python 3.9, and it happens to make use of various string manipulations, it may be worth considering where you could make use of the now-standard removeprefix() and removesuffix() methods.

Further information about the removeprefix() and removesuffix methods can be found in the respective PEP 616 – String methods to remove prefixes and suffixes.


Replace and empty strings

Bug 28029 – Replace and empty strings, which pertained to some confusing behavior of the aforementioned replace() function for strings, was adapted in Python 3.9.

To understand the misleading behavior, consider the replace() method prototype according to the Python docs.

string.replace(s, old, new[, maxreplace])

Return a copy of string s with all occurrences of substring old replaced by new. If the optional argument maxreplace is given, the first maxreplace occurrences are replaced.

We have that s is the input string, the old argument pertains to the string to search for in s, the new argument is the string to replace the old value with, and finally the maxreplace is a number specifying how many occurrences of the old value you want to replace.

If no maxreplace is specified, the replace() method defaults to applying itself to all occurrences of old to new.

So for instance, a common usage of replace() is to replace all of the spaces in a string with no spaces. Take for example:

>>> "123 456 789".replace(" ", "")

'123456789'

We see that all of the occurrences of the " " character (the space) was replaced with the "" character (no space). Taking advantage of the maxreplace parameter, we can specify how many of the instances of the character in question we wish to replace (starting from the beginning of the string).

For instance, if we just want to replace just the first instance of a space in the example above, we could do the following:

>>> "123 456 789".replace(" ", "", 1)

'123456 789'

One surprisingly and perhaps counterintuitive behavior of the replace() function was discovered in 2016, which is documented in previously mentioned Bug 28029.

The gist of the confusing behavior is captured the following example:

>>> # First example:
>>> "".replace("", "prefix")
"prefix"

>>> # Second example:
>>> "".replace("", "prefix", 1)
""

The first line makes sense, as we want to replace the empty string with the string "prefix". However, we should also expect the same output for the second example.

The change for Python 3.9 was to fix this perplexing edge case and to have the replace() function return the string provided as input instead of an empty string for all non-zero maxreplace.

Fixing this issue resolves an edge case that could come off as potentially confusing to new Python users in a function that has been with Python for quite a long time.


Summary and wrap up

While there have been third party implementations of topological sort in the Python ecosystem (e.g. topsort and NetworkX), native support has been included as of the release of Python 3.9.

The graphlib module is presently sparse, but will continue to be developed to include other graph algorithms in future iterations. Beyond the scope of this article, the TopologicalSorter also has support for parallelization and can be applied to settings where invoking such an approach would be advantageous.

On top of that, string manipulation is an important feature in most modern programming languages. While the release of Python 3.9 includes a number of newer features and functions, it is good to see a dedicated effort on the core aspects of Python’s feature set.

Hopefully you can make use of these new changes to improve your own Python projects. Check out my course Data Structures and Algorithms in Python to continue learning about Python in the real-world.

Happy learning!


Continue reading about Python


WRITTEN BYVincent Russo

Join a community of 270,000 monthly readers. A free, bi-monthly email with a roundup of Educative's top articles and coding tips.