Sunday, May 18, 2025

Apache Flink Word Count from File using Java

May 18, 2025 Leave a Reply

Apache Flink Word Count from File using Java

This post guides you through building a simple Apache Flink batch application that reads a text file, splits lines into words, and counts the number of occurrences of each word.

🧰 Prerequisites

Java 8 or 11 installed
Apache Flink installed on Ubuntu (View installation guide)
Maven or Gradle for Java project setup

📦 Step 1: Maven Dependency

<dependencies>
  <dependency>
    <groupId>org.apache.flink</groupId>
    <artifactId>flink-java</artifactId>
    <version>1.17.1</version>
  </dependency>
  <dependency>
    <groupId>org.apache.flink</groupId>
    <artifactId>flink-clients</artifactId>
    <version>1.17.1</version>
  </dependency>
</dependencies>

📝 Step 2: Java Program for Word Count from File

File:

FileWordCount.java

import org.apache.flink.api.common.functions.FlatMapFunction;
import org.apache.flink.api.java.DataSet;
import org.apache.flink.api.java.ExecutionEnvironment;
import org.apache.flink.api.java.tuple.Tuple2;
import org.apache.flink.util.Collector;

public class FileWordCount {
public static void main(String[] args) throws Exception {
// Set up the batch execution environment
final ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
// Specify the input file path
String inputPath = "src/main/resources/sample.txt";
// Read the file
DataSet<String> text = env.readTextFile(inputPath);
// Perform word count
DataSet<Tuple2<String, Integer>> counts = text
.flatMap(new Tokenizer())
.groupBy(0)
.sum(1);
// Print the result
counts.print();
}
// Tokenizer class that splits lines into words
public static final class Tokenizer implements FlatMapFunction<String, Tuple2<String, Integer>> {
@Override
public void flatMap(String line, Collector<Tuple2<String, Integer>> out) {
for (String word : line.toLowerCase().split("\\W+")) {
if (word.length() > 0) {
out.collect(new Tuple2<>(word, 1));
}
}
}
}
}

📂 Step 3: Add Sample Input File

Save a file named

sample.txt

in src/main/resources with the following content:

Hello Flink
Apache Flink is powerful
Flink processes streaming and batch data

🚀 Step 4: Run the Program

You can run this Java application from your IDE or package it with Maven and run using:

./bin/flink run -c FileWordCount path-to-your-jar.jar

✅ Sample Output

(flink,3)
(apache,1)
(is,1)
(powerful,1)
(hello,1)
(processes,1)
(streaming,1)
(and,1)
(batch,1)
(data,1)

🎯 Summary

This Flink batch job demonstrates how to process text files and perform basic transformations and aggregations using the DataSet API. You can further enhance it with filtering, sorting, and writing to files or databases.

📚 Also Read: Quiz on Apache Flink Basics

0 comments:

If you have any doubts,please let me know

The Technical Talk

Sunday, May 18, 2025

Apache Flink Word Count from File using Java