Counting the number of **Unicode characters** (code points) in a UTF-8 encoded C string requires iterating through the string and decoding the variable-length UTF-8 sequences. Unlike ASCII, UTF-8 characters can be 1 to 4 bytes long, so you cannot simply use `strlen` (which counts bytes, not characters). Here’s how to do it in C: --- ### 1. **Manual Implementation** You can manually decode UTF-8 sequences and count the number of valid code points: ```c #include #include // Function to count UTF-8 characters size_t count_utf8_chars(const char *str) { size_t count = 0; while (*str) { // Determine the number of bytes in the current UTF-8 character if ((*str & 0b10000000) == 0b00000000) { // 1-byte sequence (0xxxxxxx) count++; str += 1; } else if ((*str & 0b11100000) == 0b11000000) { // 2-byte sequence (110xxxxx) count++; str += 2; } else if ((*str & 0b11110000) == 0b11100000) { // 3-byte sequence (1110xxxx) count++; str += 3; } else if ((*str & 0b11111000) == 0b11110000) { // 4-byte sequence (11110xxx) count++; str += 4; } else { // Invalid UTF-8 sequence (skip or handle error) str++; } } return count; } int main() { const char *str = "Hello, 世界! 😊"; // UTF-8 encoded string size_t char_count = count_utf8_chars(str); printf("Character count: %zu\n", char_count); // Output: Character count: 11 return 0; } ``` **Explanation:** - UTF-8 sequences start with specific bit patterns: - `0xxxxxxx`: 1-byte character (ASCII). - `110xxxxx`: 2-byte character. - `1110xxxx`: 3-byte character. - `11110xxx`: 4-byte character. - The function checks the first byte of each sequence to determine its length and increments the character count accordingly. --- ### 2. **Using a Library (ICU or libunistring)** For more robust handling (e.g., validation and edge cases), you can use a library like **ICU** or **libunistring**. #### **ICU (International Components for Unicode)** ICU is a widely-used library for Unicode handling. Here’s an example: ```c #include #include #include #include size_t count_utf8_chars_icu(const char *str) { UErrorCode status = U_ZERO_ERROR; UConverter *conv = ucnv_open("UTF-8", &status); if (U_FAILURE(status)) { fprintf(stderr, "Failed to open converter: %s\n", u_errorName(status)); return 0; } const char *source = str; const char *sourceLimit = str + strlen(str); size_t count = 0; while (source < sourceLimit) { UChar32 c; c = ucnv_getNextUChar(conv, &source, sourceLimit, &status); if (U_FAILURE(status)) { fprintf(stderr, "Error decoding UTF-8: %s\n", u_errorName(status)); break; } count++; } ucnv_close(conv); return count; } int main() { const char *str = "Hello, 世界! 😊"; size_t char_count = count_utf8_chars_icu(str); printf("Character count: %zu\n", char_count); // Output: Character count: 11 return 0; } ``` **Steps to compile with ICU:** 1. Install ICU (e.g., `sudo apt install libicu-dev` on Ubuntu). 2. Compile with ICU: ```bash gcc -o utf8_count_icu utf8_count_icu.c -licuuc ``` --- #### **libunistring** Another lightweight library for Unicode handling: ```c #include #include size_t count_utf8_chars_unistring(const char *str) { return u8_mbsnlen((const uint8_t *)str, strlen(str)); } int main() { const char *str = "Hello, 世界! 😊"; size_t char_count = count_utf8_chars_unistring(str); printf("Character count: %zu\n", char_count); // Output: Character count: 11 return 0; } ``` **Steps to compile with libunistring:** 1. Install libunistring (e.g., `sudo apt install libunistring-dev` on Ubuntu). 2. Compile with libunistring: ```bash gcc -o utf8_count_unistring utf8_count_unistring.c -lunistring ``` --- ### 3. **Handling Invalid UTF-8** If your input might contain invalid UTF-8 sequences, you should handle errors gracefully. For example: - Skip invalid sequences. - Replace invalid sequences with a placeholder (e.g., `�`). Here’s an example of skipping invalid sequences: ```c size_t count_utf8_chars_safe(const char *str) { size_t count = 0; while (*str) { if ((*str & 0b10000000) == 0b00000000) { // 1-byte sequence count++; str++; } else if ((*str & 0b11100000) == 0b11000000) { // 2-byte sequence if ((str[1] & 0b11000000) != 0b10000000) break; // Invalid continuation byte count++; str += 2; } else if ((*str & 0b11110000) == 0b11100000) { // 3-byte sequence if ((str[1] & 0b11000000) != 0b10000000 || (str[2] & 0b11000000) != 0b10000000) break; count++; str += 3; } else if ((*str & 0b11111000) == 0b11110000) { // 4-byte sequence if ((str[1] & 0b11000000) != 0b10000000 || (str[2] & 0b11000000) != 0b10000000 || (str[3] & 0b11000000) != 0b10000000) break; count++; str += 4; } else { // Invalid UTF-8 sequence (skip) str++; } } return count; } ``` --- ### Key Notes: - **UTF-8 vs. Unicode:** This counts **code points**, which may not always match "user-perceived characters" (e.g., combining emojis or accents may use multiple code points). - **Performance:** Libraries like ICU are optimized for correctness and edge cases. - **Validation:** Always validate UTF-8 if the input is untrusted to avoid security issues. Let me know if you need further clarification! 😊