Skip to main content

Hadoop-3.4.1 using Docker 02

Description

We will create an instance of Hadoop Cluster within a Docker container.

Objectives

  • Run hadoop instance (3.4.1)
  • Run provided example

Prerequisites

  • Ubuntu
  • Docker

Set up Cluster

  • Clone the repository
    git clone https://github.com/hibuz/hadoop-docker.git
  • Go to
    cd hadoop-docker
  • Compose the docker application
    docker compose up hadoop-dev --no-build
  • Run the namenode as a mounted drive on bash
    docker exec -it hadoop bash
    Output
    hadoop@dfc392f64adc:~/hadoop-3.4.1$

    # dfc392f64adc - container id

Explore the hadoop environment

  • hadoop-3.4.1 ls
    hadoop@dfc392f64adc:~/hadoop-3.4.1$ ls -1
    bin
    etc
    include
    lib
    libexec
    LICENSE-binary
    licenses-binary
    LICENSE.txt
    logs
    NOTICE-binary
    NOTICE.txt
    README.txt
    sbin
    share
  • $HADOOP_HOME
    hadoop@dfc392f64adc:~/hadoop-3.4.1$ echo $HADOOP_HOME
    /home/hadoop/hadoop-3.4.1
  • [configuration].xml

    hadoop@dfc392f64adc:~/hadoop-3.4.1$ ls $HADOOP_HOME/etc/hadoop/*.xml
    /home/hadoop/hadoop-3.4.1/etc/hadoop/capacity-scheduler.xml /home/hadoop/hadoop-3.4.1/etc/hadoop/httpfs-site.xml
    /home/hadoop/hadoop-3.4.1/etc/hadoop/core-site.xml /home/hadoop/hadoop-3.4.1/etc/hadoop/kms-acls.xml
    /home/hadoop/hadoop-3.4.1/etc/hadoop/hadoop-policy.xml /home/hadoop/hadoop-3.4.1/etc/hadoop/kms-site.xml
    /home/hadoop/hadoop-3.4.1/etc/hadoop/hdfs-rbf-site.xml /home/hadoop/hadoop-3.4.1/etc/hadoop/mapred-site.xml
    /home/hadoop/hadoop-3.4.1/etc/hadoop/hdfs-site.xml /home/hadoop/hadoop-3.4.1/etc/hadoop/yarn-site.xml
  • sbin ls
    hadoop@dfc392f64adc:~/hadoop-3.4.1/sbin$ ls
    distribute-exclude.sh kms.sh start-balancer.sh start-yarn.sh stop-dfs.sh yarn-daemon.sh
    FederationStateStore mr-jobhistory-daemon.sh start-dfs.cmd stop-all.cmd stop-secure-dns.sh yarn-daemons.sh
    hadoop-daemon.sh refresh-namenodes.sh start-dfs.sh stop-all.sh stop-yarn.cmd
    hadoop-daemons.sh start-all.cmd start-secure-dns.sh stop-balancer.sh stop-yarn.sh
    httpfs.sh start-all.sh start-yarn.cmd stop-dfs.cmd workers.sh
  • start dfs
    hadoop@dfc392f64adc:~/hadoop-3.4.1/sbin$ start-dfs.sh
    Starting namenodes on [localhost]
    localhost: namenode is running as process 165. Stop it first and ensure /tmp/hadoop-hadoop-namenode.pid file is empty before retry.
    Starting datanodes
    localhost: datanode is running as process 391. Stop it first and ensure /tmp/hadoop-hadoop-datanode.pid file is empty before retry.
    Starting secondary namenodes [dfc392f64adc]
    dfc392f64adc: secondarynamenode is running as process 647. Stop it first and ensure /tmp/hadoop-hadoop-secondarynamenode.pid file is empty before retry.
  • start yarn
    hadoop@dfc392f64adc:~/hadoop-3.4.1/sbin$ start-yarn.sh
    Starting resourcemanager
    Starting nodemanagers
  • jps
    hadoop@dfc392f64adc:~/hadoop-3.4.1/sbin$ jps
    1922 ResourceManager
    2277 NodeManager
    165 NameNode
    647 SecondaryNameNode
    391 DataNode
    2510 Jps

Examples Provided (some selected programs)

  • Mapreduce
    hadoop@dfc392f64adc:~/hadoop-3.4.1/share/hadoop/mapreduce$ ls
    hadoop-mapreduce-client-app-3.4.1.jar hadoop-mapreduce-client-hs-3.4.1.jar hadoop-mapreduce-client-jobclient-3.4.1-tests.jar hadoop-mapreduce-client-uploader-3.4.1.jar sources
    hadoop-mapreduce-client-common-3.4.1.jar hadoop-mapreduce-client-hs-plugins-3.4.1.jar hadoop-mapreduce-client-nativetask-3.4.1.jar hadoop-mapreduce-examples-3.4.1.jar
    hadoop-mapreduce-client-core-3.4.1.jar hadoop-mapreduce-client-jobclient-3.4.1.jar hadoop-mapreduce-client-shuffle-3.4.1.jar jdiff
  • hadoop-mapreduce-examples-3.4.1.jar - valid program names:
      aggregatewordcount: An Aggregate based map/reduce program that counts the words in the input files.
    aggregatewordhist: An Aggregate based map/reduce program that computes the histogram of the words in the input files.
    bbp: A map/reduce program that uses Bailey-Borwein-Plouffe to compute exact digits of Pi.
    dbcount: An example job that count the pageview counts from a database.
    distbbp: A map/reduce program that uses a BBP-type formula to compute exact bits of Pi.
    grep: A map/reduce program that counts the matches of a regex in the input.
    join: A job that effects a join over sorted, equally partitioned datasets
    multifilewc: A job that counts words from several files.
    pentomino: A map/reduce tile laying program to find solutions to pentomino problems.
    pi: A map/reduce program that estimates Pi using a quasi-Monte Carlo method.
    randomtextwriter: A map/reduce program that writes 10GB of random textual data per node.
    randomwriter: A map/reduce program that writes 10GB of random data per node.
    secondarysort: An example defining a secondary sort to the reduce.
    sort: A map/reduce program that sorts the data written by the random writer.
    sudoku: A sudoku solver.
    teragen: Generate data for the terasort
    terasort: Run the terasort
    teravalidate: Checking results of terasort
    wordcount: A map/reduce program that counts the words in the input files.
    wordmean: A map/reduce program that counts the average length of the words in the input files.
    wordmedian: A map/reduce program that counts the median length of the words in the input files.
    wordstandarddeviation: A map/reduce program that counts the standard deviation of the length of the words in the input files.

Run hadoop-mapreduce-examples-3.4.1.jar wordcount

  • Make the HDFS directories
    hdfs dfs -mkdir -p /user/hadoop/input
  • Copy the input files
    hdfs dfs -put ./etc/hadoop/*.xml input
    verify
    hdfs dfs -ls /user/hadoop/input
    Found 10 items
    -rw-r--r-- 1 hadoop supergroup 9213 2025-04-25 10:49 /user/hadoop/input/capacity-scheduler.xml
    -rw-r--r-- 1 hadoop supergroup 856 2025-04-25 10:49 /user/hadoop/input/core-site.xml
    -rw-r--r-- 1 hadoop supergroup 14007 2025-04-25 10:49 /user/hadoop/input/hadoop-policy.xml
    -rw-r--r-- 1 hadoop supergroup 683 2025-04-25 10:49 /user/hadoop/input/hdfs-rbf-site.xml
    -rw-r--r-- 1 hadoop supergroup 840 2025-04-25 10:49 /user/hadoop/input/hdfs-site.xml
    -rw-r--r-- 1 hadoop supergroup 620 2025-04-25 10:49 /user/hadoop/input/httpfs-site.xml
    -rw-r--r-- 1 hadoop supergroup 3518 2025-04-25 10:49 /user/hadoop/input/kms-acls.xml
    -rw-r--r-- 1 hadoop supergroup 682 2025-04-25 10:49 /user/hadoop/input/kms-site.xml
    -rw-r--r-- 1 hadoop supergroup 836 2025-04-25 10:49 /user/hadoop/input/mapred-site.xml
    -rw-r--r-- 1 hadoop supergroup 990 2025-04-25 10:49 /user/hadoop/input/yarn-site.xml
  • Run Examples

    hadoop jar $HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples-3.4.1.jar wordcount input output
    Output
    ...
    ...
    2025-04-25 10:59:39,423 INFO mapred.LocalJobRunner: Finishing task: attempt_local1918393452_0001_r_000000_0
    2025-04-25 10:59:39,423 INFO mapred.LocalJobRunner: reduce task executor complete.
    2025-04-25 10:59:39,940 INFO mapreduce.Job: Job job_local1918393452_0001 running in uber mode : false
    2025-04-25 10:59:39,942 INFO mapreduce.Job: map 100% reduce 100%
    2025-04-25 10:59:39,943 INFO mapreduce.Job: Job job_local1918393452_0001 completed successfully
    2025-04-25 10:59:39,964 INFO mapreduce.Job: Counters: 36
    File System Counters
    FILE: Number of bytes read=3214979
    FILE: Number of bytes written=11244233
    FILE: Number of read operations=0
    FILE: Number of large read operations=0
    FILE: Number of write operations=0
    HDFS: Number of bytes read=307019
    HDFS: Number of bytes written=11086
    HDFS: Number of read operations=168
    HDFS: Number of large read operations=0
    HDFS: Number of write operations=13
    HDFS: Number of bytes read erasure-coded=0
    Map-Reduce Framework
    Map input records=842
    Map output records=3551
    Map output bytes=43805
    Map output materialized bytes=23062
    Input split bytes=1199
    Combine input records=3551
    Combine output records=1341
    Reduce input groups=634
    Reduce shuffle bytes=23062
    Reduce input records=1341
    Reduce output records=634
    Spilled Records=2682
    Shuffled Maps =10
    Failed Shuffles=0
    Merged Map outputs=10
    GC time elapsed (ms)=15
    Total committed heap usage (bytes)=7495221248
    Shuffle Errors
    BAD_ID=0
    CONNECTION=0
    IO_ERROR=0
    WRONG_LENGTH=0
    WRONG_MAP=0
    WRONG_REDUCE=0
    File Input Format Counters
    Bytes Read=32245
    File Output Format Counters
    Bytes Written=11086
  • See results in browser 02-hadoop-using-docker-01.png

Stops containers and removes containers, networks, and volumes created by up.

docker compose down -v

Based on https://github.com/hibuz/hadoop-docker