What Is Data Orchestration?

Learn what data orchestration is and discover the pain points it addresses.

Historical problems

To support business continuity, data models need to be regularly refreshed. In the past, engineers used the cron tool in Linux systems to schedule ELT jobs. However, as data volume and system complexity increase exponentially, creating cron jobs becomes a bottleneck and eventually hits the limitation of scalability and maintainability. To be more specific, the problems are:

  • Dependencies between jobs: In a large-scale data team, it's expected to have many dependencies between data models. For example, updating the revenue table should be done only after the sales table has been updated. In more complicated scenarios, a table can have multiple upstream dependencies, each with a different schedule. Managing all these dependencies manually is time-consuming and error-prone.

  • Performance: If not managed well, cron jobs can consume a significant amount of system resources such as CPU, memory, and disk space. With the ever-increasing volume of data, performance can quickly become an issue.

  • Engineering efforts: To maintain the quality of dozens of cron jobs or apps and process a variety of data formats, data engineers have to spend a significant amount of time writing low-level code rather than creating new data pipelines.

  • Data silos: Scattered cron jobs can easily lead to data silos, resulting in duplicated efforts, conflicting data, and inconsistent data quality. Enforcing data governance policies can also be difficult, leading to potential security issues.

Get hands-on with 1200+ tech skills courses.