Apache Flink Word Count from File using Java
This post guides you through building a simple Apache Flink batch application that reads a text file, splits lines into words, and counts the number of occurrences of each word.
🧰 Prerequisites
- Java 8 or 11 installed
- Apache Flink installed on Ubuntu (View installation guide)
- Maven or Gradle for Java project setup
📦 Step 1: Maven Dependency
<dependencies>
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-java</artifactId>
<version>1.17.1</version>
</dependency>
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-clients</artifactId>
<version>1.17.1</version>
</dependency>
</dependencies>
📝 Step 2: Java Program for Word Count from File
File:
FileWordCount.java
import org.apache.flink.api.common.functions.FlatMapFunction;
import org.apache.flink.api.java.DataSet;
import org.apache.flink.api.java.ExecutionEnvironment;
import org.apache.flink.api.java.tuple.Tuple2;
import org.apache.flink.util.Collector;
public class FileWordCount {
public static void main(String[] args) throws Exception {
// Set up the batch execution environment
final ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
// Specify the input file path
String inputPath = "src/main/resources/sample.txt";
// Read the file
DataSet<String> text = env.readTextFile(inputPath);
// Perform word count
DataSet<Tuple2<String, Integer>> counts = text
.flatMap(new Tokenizer())
.groupBy(0)
.sum(1);
// Print the result
counts.print();
}
// Tokenizer class that splits lines into words
public static final class Tokenizer implements FlatMapFunction<String, Tuple2<String, Integer>> {
@Override
public void flatMap(String line, Collector<Tuple2<String, Integer>> out) {
for (String word : line.toLowerCase().split("\\W+")) {
if (word.length() > 0) {
out.collect(new Tuple2<>(word, 1));
}
}
}
}
}
📂 Step 3: Add Sample Input File
Save a file named
sample.txt
in
src/main/resources
with the following content:
Hello Flink
Apache Flink is powerful
Flink processes streaming and batch data
🚀 Step 4: Run the Program
You can run this Java application from your IDE or package it with Maven and run using:
./bin/flink run -c FileWordCount path-to-your-jar.jar
✅ Sample Output
(flink,3)
(apache,1)
(is,1)
(powerful,1)
(hello,1)
(processes,1)
(streaming,1)
(and,1)
(batch,1)
(data,1)
🎯 Summary
This Flink batch job demonstrates how to process text files and perform basic transformations and aggregations using the DataSet API. You can further enhance it with filtering, sorting, and writing to files or databases.
📚 Also Read: Quiz on Apache Flink Basics
0 comments:
If you have any doubts,please let me know