Sunday, May 18, 2025

Apache Flink Word Count from File using Java

Apache Flink Word Count from File using Java

This post guides you through building a simple Apache Flink batch application that reads a text file, splits lines into words, and counts the number of occurrences of each word.

🧰 Prerequisites

  • Java 8 or 11 installed
  • Apache Flink installed on Ubuntu (View installation guide)
  • Maven or Gradle for Java project setup

📦 Step 1: Maven Dependency

<dependencies>
  <dependency>
    <groupId>org.apache.flink</groupId>
    <artifactId>flink-java</artifactId>
    <version>1.17.1</version>
  </dependency>
  <dependency>
    <groupId>org.apache.flink</groupId>
    <artifactId>flink-clients</artifactId>
    <version>1.17.1</version>
  </dependency>
</dependencies>

📝 Step 2: Java Program for Word Count from File

File:

FileWordCount.java


import org.apache.flink.api.common.functions.FlatMapFunction;

import org.apache.flink.api.java.DataSet;

import org.apache.flink.api.java.ExecutionEnvironment;

import org.apache.flink.api.java.tuple.Tuple2;

import org.apache.flink.util.Collector;


public class FileWordCount {

    public static void main(String[] args) throws Exception {

        // Set up the batch execution environment

        final ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();

        // Specify the input file path

        String inputPath = "src/main/resources/sample.txt";

        // Read the file

        DataSet&lt;String&gt; text = env.readTextFile(inputPath);

        // Perform word count

        DataSet&lt;Tuple2&lt;String, Integer&gt;&gt; counts = text

            .flatMap(new Tokenizer())

            .groupBy(0)

            .sum(1);

        // Print the result

        counts.print();

    }

    // Tokenizer class that splits lines into words

    public static final class Tokenizer implements FlatMapFunction&lt;String, Tuple2&lt;String, Integer&gt;&gt; {

        @Override

        public void flatMap(String line, Collector&lt;Tuple2&lt;String, Integer&gt;&gt; out) {

            for (String word : line.toLowerCase().split("\\W+")) {

                if (word.length() &gt; 0) {

                    out.collect(new Tuple2&lt;&gt;(word, 1));

                }

            }

        }

    }

📂 Step 3: Add Sample Input File

Save a file named

sample.txt
in src/main/resources with the following content:

Hello Flink Apache Flink is powerful Flink processes streaming and batch data

🚀 Step 4: Run the Program

You can run this Java application from your IDE or package it with Maven and run using:

./bin/flink run -c FileWordCount path-to-your-jar.jar

✅ Sample Output

(flink,3) (apache,1) (is,1) (powerful,1) (hello,1) (processes,1) (streaming,1) (and,1) (batch,1) (data,1)

🎯 Summary

This Flink batch job demonstrates how to process text files and perform basic transformations and aggregations using the DataSet API. You can further enhance it with filtering, sorting, and writing to files or databases.


📚 Also Read: Quiz on Apache Flink Basics

0 comments:

If you have any doubts,please let me know