A tokenize string function in C++ is a custom function or implementation that splits a string into smaller parts called “tokens” based on specified delimiters (e.g., commas, spaces). It is commonly used to process text data, extract meaningful parts, or handle formatted inputs.
How to tokenize a string in C++
Tokenization breaks a string into smaller parts, like splitting a sentence into words, so you can work with each part individually.
What is a token in C++?
A token in C++ refers to a substring or component extracted from a larger string, separated by specific "C++ is fun", the tokens are "C++", "is", and "fun" if the delimiter is a space.
Why do we need tokenization in C++?
Tokenization is useful for parsing and processing data efficiently. Here are some common use cases:
Processing input: Breaking down user input into meaningful segments.
Data analysis: Parsing CSV or log files to extract data fields.
Text manipulation: Splitting sentences into words for linguistic processing.
Command handling: Extracting arguments and commands from input strings in console applications.
Now that we understand why tokenization is useful, let’s look at how it works in C++.
Steps to tokenize a string in C++
We can create a stringstream object or use the built-in strtok() function to tokenize a string. However, we will create our own tokenizer in C++.
Follow the steps below to tokenize a string:
-
Read the complete string.
-
Select the delimiter the point that you want to tokenize your string. In this example, we will tokenize the string in every space.
-
Iterate over all the characters of the string and check when the delimiter is found.
-
If delimiter is found, then you can push your word to a vector.
-
Repeat this process until you have traversed the complete string.
-
For the last token, you won’t have any space, so that token needs to be pushed to the vector after the loop.
-
Finally, return the vector of tokens.
Note:
strtokmodifies the input string by replacing delimiters with null characters, so it is better suited for temporary strings or copies rather than original data.
Now let’s look at the code for clarity:
Coding example
Here’s a simple example demonstrating string tokenization using std::strtok:
#include <iostream>#include <cstring> // For strtokvoid tokenizeString(const std::string& str, const char* delimiter) {char tempStr[100];std::strcpy(tempStr, str.c_str()); // Convert std::string to char arraychar* token = std::strtok(tempStr, delimiter);while (token != nullptr) {std::cout << "Token: " << token << std::endl;token = std::strtok(nullptr, delimiter); // Get the next token}}int main() {std::string input = "C++ programming; tokenization, example; ; ";const char* delimiter = ", "; // Delimiters are comma and spacestd::cout << "Original String: " << input << std::endl;std::cout << "Tokens:" << std::endl;tokenizeString(input, delimiter);return 0;}
Code explanation
Line 1–2: Include the necessary libraries: iostream for output and cstring for string manipulation (strtok).
Line 4: Define the tokenizeString function, which takes a string and a delimiter for tokenizing the input string.
Line 5: Declare a character array tempStr to store a mutable copy of the input string.
Line 6: Copy the content of the input string str into tempStr using std::strcpy.
Line 8: Tokenize tempStr using std::strtok and the provided delimiter. The first token is returned and stored in token.
Line 9–12: Loop through the tokens while token is not null. Print each token to the console. Inside the loop, get the next token by calling std::strtok again with nullptr and the same delimiter.
Line 16–17: Start the main function, define the input string, and set delimiters to comma and space.
Line 21: Call tokenizeString to tokenize the input string and print the tokens.
Modify the above code to handle multiple delimiters, such as a comma (,), semicolon (;), and space ( ). Make sure no empty tokens are printed.
Common methods for tokenization in C++
Here’s a table summarizing common methods for tokenization in C++:
Method Name | Description | Use cases |
| Splits a C-style string into tokens based on specified delimiters. | Suitable for simple tokenization tasks with temporary or copied strings. Not thread-safe. |
| Extracts tokens from a stream until a specified delimiter is encountered. | Ideal for reading tokens from input streams like files or |
| Treats a string as a stream, allowing easy token extraction using | Useful for tokenizing strings with well-defined formats, such as space-separated values. |
Key takeaways
Tokenization helps break down strings into meaningful segments for easy processing.
In C++, you can use methods like
std::strtok,std::istringstream, orstd::getlinefor tokenization.Always consider edge cases like multiple consecutive delimiters or empty strings while tokenizing.
Tokenization is a fundamental skill useful in data parsing, text analysis, and many real-world applications.
Start your programming journey with our Learn C++ Course. This course offers step-by-step guidance, hands-on examples, and practical exercises to help you build a strong foundation in C++. Looking to go from beginner to expert? Explore the Become a C++ Programmer Path. This skill path covers everything from the basics of C++ to advanced programming concepts, with the skills needed to excel in real-world projects.
Frequently asked questions
Haven’t found what you were looking for? Contact Us