Jump to content
Akinix
Sign in to follow this  
Gizmo

Regex to match all letters, numbers, and spaces (In all languages, Unicode!)

Recommended Posts

Hello, I'm looking for a solution. I need a regex that matches all letters, numbers, and spaces, not just in English but in all languages including Unicode characters.

The popular example /^[a-zA-Z0-9\s]*$/ is not suitable as it only allows English letters.

  • Like 1

Share this post


Link to post
Share on other sites
/^[\p{L}\p{N}\p{Zs}]+$/gmu

This regular expression pattern is used to match any string that consists of only letters, numbers, and whitespace characters. Let's break down the pattern:

  • ^: asserts the start of the line.
  • [\p{L}\p{N}\p{Zs}]+: matches one or more characters from any of the following Unicode character classes:
    • \p{L}: any kind of letter from any language.
    • \p{N}: any kind of numeric character in any script.
    • \p{Zs}: a whitespace character that is not a line separator.
  • $: asserts the end of the line.
  • /g: the global flag, which means the pattern will be applied to the entire input string, rather than stopping after the first match.
  • /m: the multiline flag, which means the pattern will be applied across multiple lines in the input string.
  • /u: the Unicode flag, which makes the pattern work with Unicode characters.

This regular expression can be used, for example, to check if a given string consists only of alphanumeric characters and spaces, with no special characters or punctuation.

Edited by Everlasting Summer
  • Like 1

Share this post


Link to post
Share on other sites

If you plan to use this as a nickname matcher, I recommend considering the following:

/^[\p{L}\p{Mc}\p{Mn}\p{Nd}\p{P}\p{Zs}]+$/gmu
  1. This introduces stricter limitations for numbers, as it only allows decimal numbers and does not permit uncommon numbers like these:

    𒐫 (Cuneiform Numeric Sign Nine Shar2)

    𒐹 (Cuneiform Numeric Sign Five Buru)

    These characters are quite large and may have unpleasant effects on the layout and readability of nicknames.
  2. Additionally, it allows certain marks that must be permitted for some languages.
  3. It also allows punctuation characters, as some users may need them in their names.

How it works

  • \p{L}: Matches any Unicode letter.
  • \p{Mc}: Matches a spacing combining mark (a character intended to be combined with another character). Explained below.
  • \p{Mn}: Matches a non-spacing mark (a character intended to be combined with another character without taking up extra space). Explained below.
  • \p{Nd}: Matches a decimal digit.
  • \p{P}: Matches any kind of punctuation character.
  • \p{Zs}: Matches a space separator (most common form being a space character).

Why use \p{Mc} and \p{Mn}

In many cases, letters with diacritical marks can be represented not just as a single Unicode symbol, but as a combination of a base symbol and a combining mark. These letters are very common in languages such as German, Norwegian, Turkish, and many others.

It is not feasible to list all of them, but here is a selection of some common Latin script letters with diacritical marks. Keep in mind that this list is not exhaustive:

  1. Acute accent (´): á, é, í, ó, ú, ý, Á, É, Í, Ó, Ú, Ý
  2. Grave accent (`): à, è, ì, ò, ù, À, È, Ì, Ò, Ù
  3. Circumflex (^): â, ê, î, ô, û, Â, Ê, Î, Ô, Û
  4. Tilde (~): ã, ñ, õ, Ã, Ñ, Õ
  5. Diaeresis/umlaut (¨): ä, ë, ï, ö, ü, ÿ, Ä, Ë, Ï, Ö, Ü, Ÿ
  6. Ring above (˚): å, Å
  7. Cedilla (¸): ç, Ç
  8. Ogonek (˛): ą, ę, Ą, Ę
  9. Macron (¯): ā, ē, ī, ō, ū, Ā, Ē, Ī, Ō, Ū
  10. Caron/háček (ˇ): č, ď, ě, ǧ, ǩ, ł, ń, ř, š, ť, ů, ž, Č, Ď, Ě, Ǧ, Ǩ, Ł, Ń, Ř, Š, Ť, Ů, Ž
  11. Breve (˘): ă, ĕ, ğ, ĭ, ŏ, ŭ, Ă, Ĕ, Ğ, Ĭ, Ŏ, Ŭ
  12. Dot above (˙): ż, Ż
  13. Double acute (˝): ő, ű, Ő, Ű
  14. Stroke (solidus): đ, ŧ, Đ, Ŧ

These diacritical marks are applied to the Latin script used in various languages, such as French, Spanish, Portuguese, German, Polish, Romanian, and many others. Each language uses diacritical marks in unique ways to indicate changes in pronunciation, stress, or grammatical functions.

Characters that match \p{Mc} (spacing combining marks) and \p{Mn} (non-spacing marks) are often used in languages that have diacritical marks or other modifiers for base letters. These marks can indicate different phonetic or tonal qualities.

And it's not limited to Latin languages, it's widespread all around the world! Here are some examples of words and languages that use such characters:

  1. Devanagari (used in Hindi, Sanskrit, and other Indian languages):

    • Word: काम (kām) - In this Hindi word, the ā (आ) vowel is represented by a spacing combining mark called a "matra" (ा) that is added to the base consonant क (ka).
  2. Vietnamese:

    • Word: hẹn (hèn) - The 'è' vowel in this word is formed by adding a non-spacing mark called the "grave accent" (̀) to the base letter 'e'. The non-spacing mark 'dot below' (̣) is added to 'n' to form 'ṇ'.
  3. Arabic:

    • Word: مُدَرِّس (mudarris) - This word has several non-spacing marks: the "damma" (ُ) above the م (m) and "kasra" (ِ) below the د (d), as well as "shadda" (ّ) above the ر (r), which indicates a doubled consonant.
  4. Hebrew:

    • Word: שָׁלוֹם (shalom) - This word has a non-spacing mark called "kamatz" (ָ) under the ש (sh) and another non-spacing mark called "holam" (ֹ) above the ו (v)

These are just a few examples of languages that use characters matching \p{Mc} and \p{Mn} to modify their base letters. There are many other languages and scripts that utilize combining marks to represent various phonetic and tonal features.

  • Like 1

Share this post


Link to post
Share on other sites
Sign in to follow this  

×
×
  • Create New...