Data Locality with HDFS and Apache Spark

Author: Wolfgang Buchner (Machine Learning Engineer)

At the point when MapReduce was first published it noted the use of data locality mechanism for avoiding network transfer and therefore avoid a possible bottleneck. In their paper they were using a network bandwith of 100 MBit/s, which where totally common at this time, but nowadays 10 GBit/s networks are cheap and widely available [Chap 3+3.4][1].

So the question is: Is avoiding network traffic especially in Big Data processing still needed?

Let’s assume following:
We have 4 physical servers, which on each HDFS Datanode has 5 physical 10rpm SAS disks. The manufacturer of the disks notes that this disks will provide a 237 MiB/s sustained transfer speed. In Big Data, generally sustained read performance of huge data is more important than fast disk access.

So following theoretical properties are given:

  • Physical Node Network Bandwith: 10 GBit/s = 1250 MB/s
  • Max Sustained Transfer Speed per Datanode: 237 MiB/s * 5 = 1185 MiB/s = 1242,56 MB/s
    ➡ 1250 MB/s per Node Network Bandwith > 1242,56 MB/s Max Sustained Transfer Speed per Datanode

So in this case the maximal possible sustained data transfer of all disks combined of a Datanode is less than the available network bandwith and therefore our limiting factor. And yes, big data network transfer isn’t the only data which will be transferred over a network but should be the largest share. Also at this point at TIKI, the HDFS Cluster and there Datanodes are oversized according to the performance. No data processing or any other machine learning job ever succeeded in bottlenecking or hitting the limit of the theoretical speed limit of the HDFS Datanodes. We noticed that we are heavily bottlenecked by algorithms and CPU power. As long as the HDFS performs fast enough and every Datanode is planned that the maximal network bandwith cap can never be achieved, the network bandwith as a resource will never be bottlenecked and data locality as a solution for this can be ignored.

To be clear, this only applies to node based data locality. When having a large cluster with several racks or maybe multiple connected data centers, the network connection in between them are often a magnitute slower than the switched connection within a rack. So avoiding transfering data over this connection will still be a huge improvement, but most distributed file systems have already an integrated concept and implementation for this.

So the last open question is: Would having node based data locality bring any performance improvements? Maybe the network stack will add some latency?

On a normal side-by-side Hadoop Cluster installation with Apache Spark data locality would work out of the box, but at TIKI we use Kubernetes based distribution solution with Apache Spark 3.0 where the HDFS is managed and installed manually as a Daemonset. Due to the fact that worker Pods, which executes the Spark job workload, gets different IP adresses than the HDFS Data Nodes the default IP based mapping wouldn’t work, so we had to implement a custom solution for this. This was done by my collegue Linus Meierhöfer and he explained the implementation here.

Experiment environment
In my experiment I wanted to compare two data processing scenarios:

  1. All data is fetched and processed locally on the same physical machine
  2. All data is transfered over a network interface to a second physical machine

To achieve this I duplicated my test data sets and replicated one of them on all physical machines and for the other test data set I set the replication factor to 1 and manually verified that all data blocks exists only on one specific machine.

Then I started my experiment spark workloads which just reads the data and processing the data with different generic ways:

  • do statistical aggregations (describe)
  • counting every rows existing in the dataset (rowcount)
  • filtering for a specific column and counting the distribution of its nominal values (distribution)

All this workloads were executed on the same dataset but with different formats and compressions. The dataset itself contains ~10ˆ9 records each consisting of 29 columns.

Each workload has been repeated 5 times itself and following metrics have been monitored:

  • Network traffic of the physical server interfaces
  • Disk read access of the HDFS physical diskds
  • Wall time of the workloads

The next graph shows the monitored network traffic and the HDFS disk node access. The red dotted vertical line separates the experiment with data locality (left half) and the experiment without (right half). Which shows that no data has been sent over the network and the data has been read from the Node 1.

The right half shows that in the experiment without data locality the data has been read from Node 2 and then transferred to Node 2 where it was processed by the spark job. This graph just shows that the data locality avoided the network traffic completely.


The next graph shows the measured wall time of the executed workloads for different datasets. Unfortunately it shows no tendency in favour of either with or without data locality. In the left plot data locality shows minor, almost negligible, reduced wall time. But the right plot shows also that a workload without data locality can be faster too. It must also be said, that the file format Apache Parquet together with Apache spark optimizes this basic workloads so good, that only the needed / processed data needs to be transferred over the network.

Summary (tl;dr)
Network bandwiths has increased a lot in the last two decades and if it can be guaranteed that a distributed file system like HDFS won’t exceed the theoretical transfer speed of the available network bandwith, data locality could be ignored in favour of complexity. Therefore a possible network bottleneck scenario doesn’t exist anymore, as it was in the past, and it has no measurable impact in reducing the wall time of data processing workloads. Also todays optimized file formats, like Apache Spark, and data processing frameworks like Apache Spark are doing a great job in avoiding data transfer all together.

Experiment source
I executed the experiment within a jupyter IPython environment and used Apache Spark for distribution. The attached Jupyter Notebook gives a glimpse on how it was done.

Data Locality Experiment Notebook

[1]: J. Dean und S. Ghemawat, „MapReduce: Simplified Data Processing on Large Clusters,“ in OSDI’04: Sixth Symposium on Operating System Design and Implementation, San Francisco, CA, 2004, S. 137–150.

Kontaktieren Sie mich gerne, falls Fragen zum Beitrag auftreten. Ich freue mich über Ihre Nachricht und den Austausch zu diesem Thema.