ML p(r)ior | Hadoop on HPC: Integrating Hadoop and Pilot-based Dynamic Resource Management

Hadoop on HPC: Integrating Hadoop and Pilot-based Dynamic Resource Management

2016-01-31
High-performance computing platforms such as supercomputers have traditionally been designed to meet the compute demands of scientific applications. Consequently, they have been architected as producers and not consumers of data. The Apache Hadoop ecosystem has evolved to meet the requirements of data processing applications and has addressed many of the limitations of HPC platforms. There exist a class of scientific applications however, that need the collective capabilities of traditional high-performance computing environments and the Apache Hadoop ecosystem. For example, the scientific domains of bio-molecular dynamics, genomics and network science need to couple traditional computing with Hadoop/Spark based analysis. We investigate the critical question of how to present the capabilities of both computing environments to such scientific applications. Whereas this questions needs answers at multiple levels, we focus on the design of resource management middleware that might support the needs of both. We propose extensions to the Pilot-Abstraction to provide a unifying resource management layer. This is an important step that allows applications to integrate HPC stages (e.g. simulations) to data analytics. Many supercomputing centers have started to officially support Hadoop environments, either in a dedicated environment or in hybrid deployments using tools such as myHadoop. This typically involves many intrinsic, environment-specific details that need to be mastered, and often swamp conceptual issues like: How best to couple HPC and Hadoop application stages? How to explore runtime trade-offs (data localities vs. data movement)? This paper provides both conceptual understanding and practical solutions to the integrated use of HPC and Hadoop environments.
PDF

Highlights - Most important sentences from the article

Login to like/save this paper, take notes and configure your recommendations

Related Articles

2019-05-24

The increasing interest in the usage of Artificial Intelligence techniques (AI) from the research co… show more
PDF

Highlights - Most important sentences from the article

2019-05-20

Open source cloud technologies provide a wide range of support for creating customized compute node … show more
PDF

Highlights - Most important sentences from the article

2019-03-24

This paper describes a building blocks approach to the design of scientific workflow systems. We dis… show more
PDF

Highlights - Most important sentences from the article

2019-04-05

RADICAL-Cybertools (RCT) are a set of software systems that serve as middleware to develop efficient… show more
PDF

Highlights - Most important sentences from the article

2019-02-23
1902.08755 | cs.GR

We are living in the big data age: An ever increasing amount of data is being produced through data … show more
PDF

Highlights - Most important sentences from the article

2016-09-12

We suggest there is a need for a fresh perspective on the design and development of workflow systems… show more
PDF

Highlights - Most important sentences from the article

2018-12-03

This chapter presents software architectures of the big data processing platforms. It will provide a… show more
PDF

Highlights - Most important sentences from the article

2018-12-02
1812.00300 | cs.DC

Containers are standalone, self-contained units that package software and its dependencies together.… show more
PDF

Highlights - Most important sentences from the article

2019-02-26

The advances in data, computing and networking over the last two decades led to a shift in many appl… show more
PDF

Highlights - Most important sentences from the article

2019-04-26

As dataset sizes increase, data analysis tasks in high performance computing (HPC) are increasingly … show more
PDF

Highlights - Most important sentences from the article

2019-04-26

High performance computing numerical simulations are today one of the more effective instruments to … show more
PDF

Highlights - Most important sentences from the article

2018-07-12

Methods developed at the Texas Advanced Computing Center (TACC) are described and demonstrated for a… show more
PDF

Highlights - Most important sentences from the article

2018-08-02

Motivated by the need to emulate workload execution characteristics on high-performance and distribu… show more
PDF

Highlights - Most important sentences from the article

2018-11-18

With the explosive increase of big data in industry and academic fields, it is necessary to apply la… show more
PDF

Highlights - Most important sentences from the article

2015-12-27

High performance computing systems have historically been designed to support applications comprised… show more
PDF

Highlights - Most important sentences from the article