How to generate an n-gram in Java

An​ n-gram is a sequence of n characters within a given piece of text.

svg viewer

Generating a list of n-grams

For a given piece of text, we can compute a list of all the n-grams. The size of the n-grams (n)​ must be specified in the arguments.

Let’s make a list of n-grams from a string:

import java.util.*;
class Ngrams {
public static List<String> ngrams(int n, String str) {
List<String> ngrams = new ArrayList<String>();
for (int i = 0; i < str.length() - n + 1; i++)
// Add the substring or size n
ngrams.add(str.substring(i, i + n));
// In each iteration, the window moves one step forward
// Hence, each n-gram is added to the list
return ngrams;
}
public static void main( String args[] ) {
String s = "abcdef";
List<String> ngrams = ngrams(3, s);
for (String ngram : ngrams){
System.out.println(ngram);
}
}
}

All you have to do is iterate through the string with a fixed window of size n. In each iteration, the new substring, or n-gram, will be added to the ngrams list.

Creating an iterator

The lazy algorithm approach is to create an iterator that iterates the text. In each iteration, the current n-gram is printed.

import java.util.Iterator;
class NgramIterator implements Iterator<String> {
private final String str;
private final int n;
int pos = 0;
public NgramIterator(int n, String str) {
this.n = n;
this.str = str;
}
public boolean hasNext() {
return pos < str.length() - n + 1;
}
public String next() {
return str.substring(pos, pos++ + n);
}
public static void main( String args[] ) {
String s = "abcdef";
new NgramIterator(3, s).forEachRemaining(System.out::println);
}
}

Once again, n is the size of the n-grams, ​and pos instructs the iterator to begin at the start of the string.

Attributions:
  1. undefined by undefined
Copyright ©2024 Educative, Inc. All rights reserved