Apache NiFi is the tool to use if you need to automate the flow of data between systems. It has a simple drag and drop interface, which enables us to create pipelines to shift data from the source to analytics systems with relative ease.
NiFi provides secure & reliable data delivery, meaning we won’t have any data loss; which is a critically important feature of any ETL tool.
NiFi should be used for relatively simple ETL processes. For example, the conversion of data types (JSON to CSV); for data routing decisions (different pipeline depending on the FlowFile contents) and for simple enrichment of data.
NiFi should not be used for heavy computations / complex event processing, joins, rolling windows or aggregate operations. For these, use Spark.
Apache NiFi concepts:
- FlowFile: this is the data, the payload. It is comprised of the data itself and attributes (creation date, etc..). FlowFiles get persisted to disk right after creation.
- Processor: This is what applies the transformations/rules to flow files to generate new flow files. The processors can run in parallel.
- Connector: the connector is the queue of flow files to be processed by the next processor. You can define rules as to which FlowFiles should be prioritised (e.g. FIFO). Can define back pressure (ie maximum number of files in a flow).
Installing Apache NiFi is simple:
- Download the zip file from website
- Extract the zip file
- CD to the /bin directory
- CHMOD the nifi.sh file to be executable
- Run ./nifi.sh run
- Check the install at: http://localhost:8080/nifi/
