Harnessing Serverless Computing for Efficient and Scalable Big Data Analytics Workloads

Vishal Shahane

Authors

Vishal Shahane Software Engineer, Amazon Web Services, Seattle, WA, United States https://orcid.org/0009-0004-4993-5488

Keywords:

serverless computing, big data analytics, scalable workloads, elastic scaling, cloud computing, cost efficiency, event-driven architecture, AWS Lambda, Google Cloud Functions, Azure Functions

Abstract

In the era of big data, the ability to efficiently and scalably process vast amounts of information is crucial for organizations across various industries. Traditional big data analytics frameworks often require substantial infrastructure investments and ongoing management efforts, which can be resource-intensive and costly. Serverless computing has emerged as a transformative paradigm that promises to address these challenges by abstracting the underlying infrastructure management, thereby enabling developers to focus on their applications. This research paper explores the potential of harnessing serverless computing for efficient and scalable big data analytics workloads.

Serverless computing, characterized by its event-driven architecture and automatic scaling capabilities, offers a compelling alternative to conventional server-based approaches. In a serverless model, cloud providers manage the provisioning, scaling, and maintenance of servers, allowing developers to deploy code in the form of discrete functions that are executed in response to events. This model inherently supports scalability, as the cloud provider dynamically allocates resources based on the workload's demands, ensuring efficient utilization without the need for manual intervention.

The paper begins by examining the core principles of serverless computing and its distinguishing features, such as statelessness, fine-grained resource allocation, and event-driven execution. We then delve into the specific requirements of big data analytics workloads, which include handling large volumes of data, processing complex queries, and delivering low-latency results. By mapping these requirements to the capabilities of serverless computing, we identify several advantages that make serverless an attractive option for big data analytics.

One of the primary benefits of serverless computing for big data analytics is its ability to handle elastic scaling. Big data workloads often experience fluctuating demand, with periods of intense activity followed by idle times. Serverless platforms automatically scale up during peak usage and scale down when demand decreases, optimizing resource consumption and reducing costs. Additionally, the pay-as-you-go pricing model of serverless computing ensures that organizations only pay for the actual compute resources used, further enhancing cost efficiency.

To validate the feasibility and performance of serverless computing for big data analytics, we conducted a series of experiments using popular serverless platforms such as AWS Lambda, Google Cloud Functions, and Azure Functions. These experiments involved processing various big data workloads, including real-time data streaming, batch processing, and machine learning model inference. Our results demonstrate that serverless computing can achieve comparable, if not superior, performance to traditional server-based approaches while significantly reducing operational complexity and cost.

Moreover, the paper explores the challenges associated with serverless computing in the context of big data analytics. These challenges include cold start latency, limited execution time, and the complexity of managing stateful operations. We discuss potential solutions and best practices to mitigate these issues, such as using warming strategies to reduce cold start latency, leveraging external storage services for stateful operations, and decomposing large tasks into smaller, more manageable functions.

The research concludes by highlighting future directions for integrating serverless computing with big data analytics. We envision advancements in serverless orchestration frameworks that seamlessly coordinate complex workflows, improvements in serverless data processing engines that optimize query execution, and enhanced support for hybrid serverless architectures that combine serverless and traditional server-based components. Additionally, emerging technologies such as edge computing and federated learning present new opportunities for extending the capabilities of serverless big data analytics.

This research demonstrates that serverless computing holds significant promise for transforming big data analytics by offering a scalable, efficient, and cost-effective solution. By harnessing the inherent strengths of serverless computing, organizations can better manage their big data workloads, achieve faster insights, and drive innovation without the burdens of traditional infrastructure management.

References

A. Baldini et al., "Serverless Computing: Current Trends and Open Problems," in Proc. IEEE CCGRID, Madrid, Spain, May 2017, pp. 256-265.

G. Adzic and R. Chatley, "Serverless Computing: Economic and Architectural Impact," in Proc. ACM SoCC, Santa Clara, CA, USA, Sep. 2017, pp. 1-2.

L. Wang, M. S. Abad, S. Yi, and Q. Li, "Lambada: Interactive Data Analytics on AWS Lambda," in Proc. IEEE ICDCS, Vienna, Austria, Jul. 2018, pp. 2438-2443.

E. Jonas, Q. Pu, S. Venkataraman, I. Stoica, and B. Recht, "Occupy the Cloud: Distributed Computing for the 99%," in Proc. ACM SoCC, Santa Clara, CA, USA, Oct. 2017, pp. 445-451.

E. Jonas et al., "Cloud Programming Simplified: A Berkeley View on Serverless Computing," arXiv preprint arXiv:1902.03383, 2019.

J. Spillner, "Translucent Functions: A New Model for Generic Cloud Programming," in Proc. IEEE CLOSER, Rome, Italy, Apr. 2017, pp. 541-548.

K. Figiela, M. Pawlik, M. Malawski, R. Filcek, and D. Jelinski, "Performance Evaluation of Heterogeneous Cloud Functions," Future Gener. Comput. Syst., vol. 87, pp. 293-304, Oct. 2018.

I. Baldini et al., "The Serverless Trilemma: Function Composition for Serverless Computing," in Proc. ACM/IFIP/USENIX Middleware, Las Vegas, NV, USA, Dec. 2017, pp. 1-15.

W. Zhang et al., "OpenLambda: An Open Framework for Developing and Deploying Serverless Applications," IEEE Trans. Cloud Comput., vol. 7, no. 3, pp. 649-661, Jul./Sep. 2019.

L. Wang, Y. Wang, and J. Xu, "Adaptive Execution of Serverless Functions in Edge Computing Environments," in Proc. IEEE INFOCOM Workshops, Paris, France, Apr. 2019, pp. 363-368.

D. Jackson and S. Clynch, "An Investigation of the Impact of Language Runtime on the Performance and Cost of Serverless Functions," in Proc. IEEE CLOUD, San Francisco, CA, USA, Jul. 2018, pp. 435-442.

J. Lin and S. P. Marbach, "Scaling Big Data Machine Learning with Serverless Infrastructure," in Proc. IEEE BigData, Boston, MA, USA, Dec. 2017, pp. 2475-2483.

P. Sbarski, P. Chang, and M. Mule, Serverless Architectures on AWS. Shelter Island, NY, USA: Manning Publications, 2017.

R. Castro Fernandez, J. M. Hellerstein, and T. S. Parikh, "Wide-Area Sensor Data Analytics: A Case Study," in Proc. IEEE DEBS, Hamilton, New Zealand, Jun. 2017, pp. 22-31.

H. Shafiei, G. McKinley, and M. Alizadeh, "FAASFlow: A Lightweight Flow Control Framework for Serverless Edge Computing," in Proc. IEEE ICC, Dublin, Ireland, Jun. 2020, pp. 1-6.

L. Wang, M. V. S. Anwar, and J. Xu, "Scalable and Efficient Application State Management for Serverless Computing," in Proc. IEEE ICDCS, Dallas, TX, USA, Jul. 2019, pp. 684-695.

P. Patel et al., "An Analysis of the Cost and Performance of Serverless Computing," in Proc. ACM SoCC, Santa Clara, CA, USA, Oct. 2019, pp. 185-199.

A. Klimovic et al., "Pocket: Elastic Ephemeral Storage for Serverless Analytics," in Proc. USENIX FAST, Santa Clara, CA, USA, Feb. 2018, pp. 105-118.

K. Figiela, D. Kolasa, and M. Malawski, "Optimizing Costs of Scientific Workflows in AWS Lambda," Future Gener. Comput. Syst., vol. 106, pp. 248-264, May 2020.

G. Carver and D. Lee, "Managing Serverless Computing with Kubernetes," in Proc. IEEE CLOUD, Milan, Italy, Jul. 2019, pp. 1-10.

G. McGrath, B. Brenner, and P. R. Brenner, "Serverless Computing: Design, Implementation, and Performance," in Proc. IEEE CLOUD, San Francisco, CA, USA, Jul. 2017, pp. 162-168.

A. Pietri and G. Guerrieri, "Evaluation of Serverless Computing for Data-Intensive Applications," in Proc. IEEE CCGRID, Larnaca, Cyprus, May 2019, pp. 461-468.

A. Lenk, L. M. Leahy, and B. Freisleben, "Leveraging Serverless Frameworks for Cloud Function Orchestration," in Proc. IEEE UCC, Austin, TX, USA, Dec. 2019, pp. 1-8.

A. C. Lokhande, S. Ramesh, and K. Ponnurangam, "Scalable Video Analytics Using Serverless Computing and Apache Spark," in Proc. IEEE BigData, Seattle, WA, USA, Dec. 2018, pp. 4576-4583.

M. A. Shah, A. A. Chishti, and A. Rafique, "Analyzing Serverless Computing for Big Data Systems," in Proc. IEEE CloudCom, Nicosia, Cyprus, Dec. 2018, pp. 1-8.

D. J. Kang et al., "Serverless Cloud Function Scheduling for Data-Intensive Applications," Future Gener. Comput. Syst., vol. 112, pp. 942-950, Nov. 2020.

E. Jonas, Q. Pu, S. Venkataraman, and I. Stoica, "Sprocket: A Serverless Video Processing Framework," in Proc. ACM SoCC, Santa Clara, CA, USA, Oct. 2017, pp. 470-484.

J. S. Ward and A. Barker, "FaaSdom: A Benchmark Suite for Serverless Computing," in Proc. IEEE BigData, Seattle, WA, USA, Dec. 2018, pp. 1-12.

B. Rashidi et al., "Application Execution on Serverless Edge Cloud: A Review," IEEE Commun. Surv. Tutorials, vol. 22, no. 1, pp. 512-534, 1st Quart. 2020.

A. Ghosh, R. S. Mendiratta, and S. Patnaik, "Serverless Computing for Real-time Data Processing," in Proc. IEEE ICCCI, Chennai, India, Feb. 2020, pp. 1-5.