Hadoop-3.4.1 using Docker 02

Description

We will create an instance of Hadoop Cluster within a Docker container.

Objectives

Run hadoop instance (3.4.1)
Run provided example

Prerequisites

Ubuntu
Docker

Set up Cluster

Clone the repository

git clone https://github.com/hibuz/hadoop-docker.git

Go to
```
cd hadoop-docker
```
Compose the docker application
```
docker compose up hadoop-dev --no-build
```

Run the namenode as a mounted drive on bash

docker exec -it hadoop bash

Output

hadoop@dfc392f64adc:~/hadoop-3.4.1$

# dfc392f64adc - container id

Explore the hadoop environment

hadoop-3.4.1 ls

hadoop@dfc392f64adc:~/hadoop-3.4.1$ ls -1
bin
etc
include
lib
libexec
LICENSE-binary
licenses-binary
LICENSE.txt
logs
NOTICE-binary
NOTICE.txt
README.txt
sbin
share

$HADOOP_HOME

hadoop@dfc392f64adc:~/hadoop-3.4.1$ echo $HADOOP_HOME
/home/hadoop/hadoop-3.4.1

[configuration].xml

hadoop@dfc392f64adc:~/hadoop-3.4.1$ ls $HADOOP_HOME/etc/hadoop/*.xml
/home/hadoop/hadoop-3.4.1/etc/hadoop/capacity-scheduler.xml  /home/hadoop/hadoop-3.4.1/etc/hadoop/httpfs-site.xml
/home/hadoop/hadoop-3.4.1/etc/hadoop/core-site.xml	     /home/hadoop/hadoop-3.4.1/etc/hadoop/kms-acls.xml
/home/hadoop/hadoop-3.4.1/etc/hadoop/hadoop-policy.xml	     /home/hadoop/hadoop-3.4.1/etc/hadoop/kms-site.xml
/home/hadoop/hadoop-3.4.1/etc/hadoop/hdfs-rbf-site.xml	     /home/hadoop/hadoop-3.4.1/etc/hadoop/mapred-site.xml
/home/hadoop/hadoop-3.4.1/etc/hadoop/hdfs-site.xml	     /home/hadoop/hadoop-3.4.1/etc/hadoop/yarn-site.xml

sbin ls

hadoop@dfc392f64adc:~/hadoop-3.4.1/sbin$ ls
distribute-exclude.sh  kms.sh			start-balancer.sh    start-yarn.sh     stop-dfs.sh	   yarn-daemon.sh
FederationStateStore   mr-jobhistory-daemon.sh	start-dfs.cmd	     stop-all.cmd      stop-secure-dns.sh  yarn-daemons.sh
hadoop-daemon.sh       refresh-namenodes.sh	start-dfs.sh	     stop-all.sh       stop-yarn.cmd
hadoop-daemons.sh      start-all.cmd		start-secure-dns.sh  stop-balancer.sh  stop-yarn.sh
httpfs.sh	       start-all.sh		start-yarn.cmd	     stop-dfs.cmd      workers.sh

start dfs

hadoop@dfc392f64adc:~/hadoop-3.4.1/sbin$ start-dfs.sh
Starting namenodes on [localhost]
localhost: namenode is running as process 165.  Stop it first and ensure /tmp/hadoop-hadoop-namenode.pid file is empty before retry.
Starting datanodes
localhost: datanode is running as process 391.  Stop it first and ensure /tmp/hadoop-hadoop-datanode.pid file is empty before retry.
Starting secondary namenodes [dfc392f64adc]
dfc392f64adc: secondarynamenode is running as process 647.  Stop it first and ensure /tmp/hadoop-hadoop-secondarynamenode.pid file is empty before retry.

start yarn

hadoop@dfc392f64adc:~/hadoop-3.4.1/sbin$ start-yarn.sh
Starting resourcemanager
Starting nodemanagers

jps

hadoop@dfc392f64adc:~/hadoop-3.4.1/sbin$ jps
ResourceManager
NodeManager
NameNode
SecondaryNameNode
DataNode
Jps

Examples Provided (some selected programs)

Mapreduce

hadoop@dfc392f64adc:~/hadoop-3.4.1/share/hadoop/mapreduce$ ls
hadoop-mapreduce-client-app-3.4.1.jar	  hadoop-mapreduce-client-hs-3.4.1.jar		hadoop-mapreduce-client-jobclient-3.4.1-tests.jar  hadoop-mapreduce-client-uploader-3.4.1.jar  sources
hadoop-mapreduce-client-common-3.4.1.jar  hadoop-mapreduce-client-hs-plugins-3.4.1.jar	hadoop-mapreduce-client-nativetask-3.4.1.jar	   hadoop-mapreduce-examples-3.4.1.jar
hadoop-mapreduce-client-core-3.4.1.jar	  hadoop-mapreduce-client-jobclient-3.4.1.jar	hadoop-mapreduce-client-shuffle-3.4.1.jar	   jdiff

hadoop-mapreduce-examples-3.4.1.jar - valid program names:

  aggregatewordcount: An Aggregate based map/reduce program that counts the words in the input files.
  aggregatewordhist: An Aggregate based map/reduce program that computes the histogram of the words in the input files.
  bbp: A map/reduce program that uses Bailey-Borwein-Plouffe to compute exact digits of Pi.
  dbcount: An example job that count the pageview counts from a database.
  distbbp: A map/reduce program that uses a BBP-type formula to compute exact bits of Pi.
  grep: A map/reduce program that counts the matches of a regex in the input.
  join: A job that effects a join over sorted, equally partitioned datasets
  multifilewc: A job that counts words from several files.
  pentomino: A map/reduce tile laying program to find solutions to pentomino problems.
  pi: A map/reduce program that estimates Pi using a quasi-Monte Carlo method.
  randomtextwriter: A map/reduce program that writes 10GB of random textual data per node.
  randomwriter: A map/reduce program that writes 10GB of random data per node.
  secondarysort: An example defining a secondary sort to the reduce.
  sort: A map/reduce program that sorts the data written by the random writer.
  sudoku: A sudoku solver.
  teragen: Generate data for the terasort
  terasort: Run the terasort
  teravalidate: Checking results of terasort
  wordcount: A map/reduce program that counts the words in the input files.
  wordmean: A map/reduce program that counts the average length of the words in the input files.
  wordmedian: A map/reduce program that counts the median length of the words in the input files.
  wordstandarddeviation: A map/reduce program that counts the standard deviation of the length of the words in the input files.

Run hadoop-mapreduce-examples-3.4.1.jar wordcount

Make the HDFS directories
```
hdfs dfs -mkdir -p /user/hadoop/input
```

Copy the input files

hdfs dfs -put ./etc/hadoop/*.xml input

verify

hdfs dfs -ls /user/hadoop/input
Found 10 items
-rw-r--r--   1 hadoop supergroup       9213 2025-04-25 10:49 /user/hadoop/input/capacity-scheduler.xml
-rw-r--r--   1 hadoop supergroup        856 2025-04-25 10:49 /user/hadoop/input/core-site.xml
-rw-r--r--   1 hadoop supergroup      14007 2025-04-25 10:49 /user/hadoop/input/hadoop-policy.xml
-rw-r--r--   1 hadoop supergroup        683 2025-04-25 10:49 /user/hadoop/input/hdfs-rbf-site.xml
-rw-r--r--   1 hadoop supergroup        840 2025-04-25 10:49 /user/hadoop/input/hdfs-site.xml
-rw-r--r--   1 hadoop supergroup        620 2025-04-25 10:49 /user/hadoop/input/httpfs-site.xml
-rw-r--r--   1 hadoop supergroup       3518 2025-04-25 10:49 /user/hadoop/input/kms-acls.xml
-rw-r--r--   1 hadoop supergroup        682 2025-04-25 10:49 /user/hadoop/input/kms-site.xml
-rw-r--r--   1 hadoop supergroup        836 2025-04-25 10:49 /user/hadoop/input/mapred-site.xml
-rw-r--r--   1 hadoop supergroup        990 2025-04-25 10:49 /user/hadoop/input/yarn-site.xml

Run Examples


hadoop jar $HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples-3.4.1.jar  wordcount input output

Output

...
...
2025-04-25 10:59:39,423 INFO mapred.LocalJobRunner: Finishing task: attempt_local1918393452_0001_r_000000_0
2025-04-25 10:59:39,423 INFO mapred.LocalJobRunner: reduce task executor complete.
2025-04-25 10:59:39,940 INFO mapreduce.Job: Job job_local1918393452_0001 running in uber mode : false
2025-04-25 10:59:39,942 INFO mapreduce.Job:  map 100% reduce 100%
2025-04-25 10:59:39,943 INFO mapreduce.Job: Job job_local1918393452_0001 completed successfully
2025-04-25 10:59:39,964 INFO mapreduce.Job: Counters: 36
	File System Counters
		FILE: Number of bytes read=3214979
		FILE: Number of bytes written=11244233
		FILE: Number of read operations=0
		FILE: Number of large read operations=0
		FILE: Number of write operations=0
		HDFS: Number of bytes read=307019
		HDFS: Number of bytes written=11086
		HDFS: Number of read operations=168
		HDFS: Number of large read operations=0
		HDFS: Number of write operations=13
		HDFS: Number of bytes read erasure-coded=0
	Map-Reduce Framework
		Map input records=842
		Map output records=3551
		Map output bytes=43805
		Map output materialized bytes=23062
		Input split bytes=1199
		Combine input records=3551
		Combine output records=1341
		Reduce input groups=634
		Reduce shuffle bytes=23062
		Reduce input records=1341
		Reduce output records=634
		Spilled Records=2682
		Shuffled Maps =10
		Failed Shuffles=0
		Merged Map outputs=10
		GC time elapsed (ms)=15
		Total committed heap usage (bytes)=7495221248
	Shuffle Errors
		BAD_ID=0
		CONNECTION=0
		IO_ERROR=0
		WRONG_LENGTH=0
		WRONG_MAP=0
		WRONG_REDUCE=0
	File Input Format Counters
		Bytes Read=32245
	File Output Format Counters
		Bytes Written=11086

See results in browser

Stops containers and removes containers, networks, and volumes created by up.

docker compose down -v

Based on https://github.com/hibuz/hadoop-docker

Description​

Objectives​

Prerequisites​

Set up Cluster​

Explore the hadoop environment​

Examples Provided (some selected programs)​

Run hadoop-mapreduce-examples-3.4.1.jar wordcount​

Stops containers and removes containers, networks, and volumes created by up.​