When Hadoop is deployed with on-premises architecture, compute and storage are combined together. As a result, compute and storage must be scaled together and the clusters must be persistently on otherwise the data becomes inaccessible. On the cloud, compute and storage can be separated with a service such as AWS EC2 and AWS S3 used as the object store. This means they can be scaled separately depending on the data team’s needs. Why does this distinction matter? Since compute and storage are tied together in an on-premises solution, elasticity is much harder to achieve and manage. On the other hand, a key advantage of cloud infrastructure is it gives the data team fine grained control over speed vs cost. How This Applies to Big Data While a key advantage of big data technology is the ability to collect and store large volumes of structured, unstructured and raw data in a data lake, most organizations only end up processing a small percentage of the data they gather. According to recent research from Forrester, an estimated 60-73% of data that businesses store ends up not being processed. Given this statistic, deployments that tie compute and storage together end up spending on compute […]