DESIGNING THE LAYER
This step involves conceptualizing the solution and defining the components required for implementation
- Storage: Bring data to an internal storage layer. This is a very important step, considering that the sources are purely for data transfer and there is no guarantee of persistence or availability.
- Identification: We receive multiple types of data from our partners. Every type of data has its own input definition, output definition and processing steps. There was a need to identify the type of data and triage it to the corresponding data processor.
- Normalisation: Process the files as soon as the data is sent by the partners. The data is normalised to match the schema required by downstream components. The normalised files would then be picked by scheduled batch processing jobs
DATA INGESTION COMPONENTS
- File Watcher: Identifies new files and triggers data processing jobs
- Workflow Orchestrator: Runs defined data processing workflow to produce a normalized output.
- Data Processing Engine: Separates the data wrangling from the orchestrator, and does the heavy lifting of data manipulation
- Internal Storage Layer: Stores incoming raw data as well as normalised output
SELECTING THE RIGHT TECH STACK
This step is tricky to accomplish. One should keep future plans in mind while reviewing components. We started by listing our requirements for each component.
- File watcher: We wanted this service to do the following things
- Identify new files.
- Selectively ignore certain folders within the provided destination.
- Upload the file to internal storage.
- Extract details required for file type identification.
- Trigger the data triaging service with the details extracted in the previous step
We wanted this service to do nothing more than above actions, so expected it to be lightweight.
This was also supposed to be customisable to provide a seamless data transfer experience to our partners. Hence, we ended up writing a custom script in python to handle this part.
- Workflow Orchestrator: We reviewed multiple options like Oozie, Airflow, Azkaban, Luigi, Nifi. Apache Nifi was a candidate which provided out-of-the-box features like back-pressure, data provenance, prioritized data queuing etc to build a robust data pipeline. Our choice of workflow orchestrator to not only manage our data pipeline jobs but also to handle future workflows for reporting and ML model operationalisation. We ended up selecting Apache Airflow majorly for 3 of the below reasons –
- Rich integration with Kubernetes: Our entire infrastructure is deployed as dockerized containers in Kubernetes. Airflow’s KubernetesPodOperator launches a docker image as a kubernetes pod to execute an individual Airflow task.
- Python allows seamless integration with anything: Airflow uses Python code for creating workflows, making complex workflows and ML operationalisation possible. It also makes interaction with other infra components easy with python connectors available for everything under the sun.
- Better UI and API: Airflow webserver provides an excellent way to track DAG progress along with APIs to trigger, track, pause and resume workflows. This makes it easy to integrate with other components in the product.
- Data Processing Engine: Apache Spark was a no brainer here. With time, Apache Spark has become the industry standard in open-source big data manipulation. We leverage Spark’s in memory computation for ETL jobs and reporting workloads. You’ll be amazed to know that Spark has a hidden rest api layer which allows you to submit, track and kill jobs. In our Kubernetes environment, this feature was a saviour as all the infra components communicate through API requests only Internal Storage Layer.
- The Internal Storage: Considering the scope of the product, we did not want to tie ourselves to a specific cloud provider. HDFS on EC2 machines was chosen over S3 keeping the above criteria in mind
The data-pipeline ingests 30+ different file types with ingestion time of less than 10 mins of data reception. The design helps us scale easily, the onboarding time of new data partners is less than 15 mins via a self serve tool. The tech-stack allows us to auto-scale the platform with increasing data volume and product usage running data and ML workflows with SLA greater than 99%.
About the author
Bharath Kumar is the Data Platform Engineering Lead in Goals101 and has set up Goals101’s data-platform. With experience in solving machine learning and data engineering problems for giants like Microsoft and Amazon, he has a particular interest in big data solutions and data products. Bharath spends most of his time exploring open-source tools and building internal services for our data-platform. When not coding, you can find Bharath travelling, ticking off places from his bucket list.