Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
969 views
in Technique[技术] by (71.8m points)

regex - Regular expression to confirm whether a string is a valid Python identifier?

I have the following definition for an Identifier:

Identifier --> letter{ letter| digit}

Basically I have an identifier function that gets a string from a file and tests it to make sure that it's a valid identifier as defined above.

I've tried this:

if re.match('w+(wd)?', i):     
  return True
else:
  return False

but when I run my program every time it meets an integer it thinks that it's a valid identifier.

For example

c = 0 ;

it prints c as a valid identifier which is fine, but it also prints 0 as a valid identifer.

What am I doing wrong here?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)

Question was made 10 years ago, when Python 2 was still dominant. As many comments in the last decade demonstrated, my answer needed a serious update, starting with a big heads up:

No single regex will properly match all (and only) valid Python identifiers. It didn't for Python 2, it doesn't for Python 3.

The reasons are:

  • As @JoeCondron pointed out, Python reserved keywords such as True, if, return, are not valid identifiers, and regexes alone are unable to handle this, so additional filtering is required.

  • Python 3 allows non-ascii letters and numbers in an identifier, but the Unicode categories of letters and numbers accepted by the lexical parser for a valid identifier do not match the same categories of d, w, W in the re module, as demonstrated in @martineau's counter-example and explained in great detail by @Hatshepsut's amazing research.

While we could try to solve the first issue using keyword.iskeyword(), as @Alexander Huszagh suggested, and workaround the other by limiting to ascii-only identifiers, why bother using a regex at all?

As Hatshepsut said:

str.isidentifier() works

Just use it, problem solved.


As requested by the question, my original 2012 answer presents a regular expression based on the Python's 2 official definition of an identifier:

identifier ::=  (letter|"_") (letter | digit | "_")*

Which can be expressed by the regular expression:

^[^dW]w*

Example:

import re
identifier = re.compile(r"^[^dW]w*", re.UNICODE)

tests = [ "a", "a1", "_a1", "1a", "aa$%@%", "aa bb", "aa_bb", "aa
" ]
for test in tests:
    result = re.match(identifier, test)
    print("%r= %s" % (test, (result is not None)))

Result:

'a'      = True
'a1'     = True
'_a1'    = True
'1a'     = False
'aa$%@%' = False
'aa bb'  = False
'aa_bb'  = True
'aa
'   = False

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...