Orchestration : Oozie

Oozie enables orchestration and scheduling of hadoop jobs in the ecosystem.

•Apache Oozie is a real time scheduler and workflow engine that blends well with large production environments

•It is a server based workflow engine

•Oozie can run workflow jobs with MapReduce and Pig action nodes

•Oozie Workflow jobs are Directed Acyclical Graphs (DAGs) of actions.

Oozie Architecture

Oozie Workflow Nodes

Control Flow

•Start/end/kill

•Decision

•Fork/join

Actions

•Map-reduce

•Pig

•Hdfs

•Sub-workflow

•Java-run custom java code

•To run oozie workflows, two files are needed.

  1. workflow.xml (stored in HDFS)

•It contains the structure of workflow. 2.job.properties (stored in local)

•It contains the configuration properties.

Oozie Server

•The Oozie server is designed to work with either MR V1 or YARN. Please note that it cannot work with both simultaneously •It can be configured with CATALINA_BASE variable in /etc/oozie/conf/oozie-env.sh

Hadoop 1

•CATALINA_BASE = / usr /lib/ oozie /oozie-server-0.20

Hadoop 2

•CATALINA_BASE=/ usr /lib/ oozie / oozie-server

Oozie Sample Workflow

nameNode= Address of NameNode

jobTracker= Address of JbTracker

oozie.libpath= Path containing related jars

oozie.wf.application.path=Path containing workflow.xml

Oozie Coordinator

•Oozie Coordinator is a collection of predicates (conditional statements based on time-frequency and data availability) and actions (i.e. Hadoop Map/Reduce jobs, Hadoop file system, Hadoop Streaming, Pig, Java and Oozie sub-workflow).

•Actions are recurrent workflow jobs invoked each time predicate returns true.

<coordinator-app name=“ Name of workflow " frequency=“ frequency in minutes " start =“Start Time" end=“ End Time " timezone =“Time Zone" xmlns =“uri:oozie:coordinator:0.1”>