Hacker News new | past | comments | ask | show | jobs | submit login
RegExr 2.0 (regexr.com)
106 points by _kushagra on April 5, 2014 | hide | past | favorite | 43 comments



Regex testing is cool, but there are dozens of these kinds of tools and I'd really love to see some other kinds of regex tools

- A list generator. Enter a regex, set repetition operator constraints (e.g. ->{0,3}, +->{1,3}, .->[A-Z0-9 ], etc.) and have it exhaustively generate a list of matching strings. This is helpful when you have a regex that matches your test strings, but also to let you know what else* it'll match. The constraints are to keep it from generating infinite lists. Even if it jams out tens or hundreds of thousands of produced strings, it's still useful. I've found that most people just build up the first regex that will "match" their input text, and move on without thinking about all the edge cases they've just introduced.

- A regex assembler optimizer. Give it a few regexes, have it assemble them into one large regex and optimize it. It's got to do better than just | or'ing all the regexes together. I've seen some work done on using trie variants to do this, but have no idea how far along the work is on this.

- A regex list generator. Give it a list of strings you want to match and have it generate a regex. A sliding "fuzziness" control could tell it to take alternates in the same character position and substitute either

1. Just the characters in the given list - a, t and q in the same position generates a|t|q

2. A representative narrow character range - if I give it a|t|q it knows to use [A-Z] while a|t|q|4 might generate [A-Z0-9]

3. A larger character range, a|t|q might just go ahead and produce [A-Z0-9]

4. An even larger character range, whatever it is, just use .

And maybe another slider for repetitions, so if I end up with [A-Z][A-Z][A-Z], should it just produce [A-Z]{3} or can I go ahead and have it [A-Z]+

Jam the result through an optimizer (see previous idea above) to clean up the regex and maybe even run it through the list generator to check if it produces only what you want.


>- A regex assembler optimizer. Give it a few regexes, have it assemble them into one large regex and optimize it. It's got to do better than just | or'ing all the regexes together. I've seen some work done on using trie variants to do this, but have no idea how far along the work is on this.

That should be unnecessary if your regex engine does the dfa transformation. basically, converts the regexp into a state machine and then it combines all of the branches in the state machine to generate synthetic states that can represent the "superposition" of matching multiple branches. this means your regex (once compiled) will run in bounded memory and max time proportional to the input (iirc)


I actually do the combining idea all the time. As long as the language is roughly pcre compatible you can use this to spit out your regex and (if necessary for your alternate language tweak it a bit so it fits).

I've generated some very massive regex's that are quite speedy.

Merger

  https://metacpan.org/pod/Regexp::Assemble
These are also super handy

  https://metacpan.org/pod/Number::Range::Regex
  https://metacpan.org/pod/Regexp::Common


Yeah, Regexp::Assemble was what I had in mind. There's a few that try to generate a list of matching strings from the expression, but I've never been satisfied with their output. Either they're slow, or don't let you constrain the regex, and all of them don't generate comprehensive lists for some reason.


> I'd really love to see some other kinds of regex tools

I'd really love to see a better regex syntax. The current obviously is deficient beyond repair. The tools cannot address the root of the problem.



Why don't you take a crack at it?


People just cannot do unicode even remotely properly. Just cannot.

𝄞 is one char, not two. привет is matched by \w+.

PS there's some advanced stuff but where is basic [[:posix:]] char classes?


Just to make it clear: It does not even support the basic Latin-1 charset correctly. Matching my family-name requires manual intervention. This is sad.

It seems a very nice regex page otherwise.


Creator here - can you elaborate? What is your family name? The example in this thread ("Grüneis") matches and displays correctly in all the browsers I've tested.

Are you perhaps trying to use a RegEx feature that is not supported by JS? Currently, RegExr only supports the JS flavour of RegEx.


Forget it, I was not used to JavaScript RegEx. I just looked it up on MDN, and it really defines `\w` to be very limited. Doesn't really make it any better, but whatever.


Family name = Grüneis


It doesn't support \p{} either for matching Unicode classes. e.g. \p{Lu} matches uppercase letters (so also Æ and Ö counts).


I couldn't find a way to add the /u or /s flag. There are only allowed /i, /g and /m :(


Creator here - we are currently relying on the JS RegExp API, and thus only support features of that engine, which are somewhat limited. In the future, we may support other flavours. We may also add specific errors for more common features that are not supported, as I've already done for lookbehinds.


> Uh-oh, it looks like your browser is not supported.

> RegExr only supports modern desktop browsers.

I'm using Firefox 30 on Ubuntu. I think it's plenty modern :)


I get the same message with chrome 34 on android 4.4.2


Pretty sure Android is not commonly considered a desktop system ;) Though mobile (or at least tablet) support would be cool


same here :(


I got the same message (FF28 Mac). Then I turned session cookies on and it worked.

Obviously they need to fix the error message...


No problem here with Fedora 20 and Firefox 31.0a1 (2014-04-04)


Very nicely done. As someone else pointed out there are quite a few of these tools, but I think you've done a really nice job with this one. One suggestion: make the reference easier to scan at a top level as opposed to drilling down.


I'm guessing the following is either near-impossible or pure-impossible, but:

Is there a tool that allows you to highlight portions of a string and generate a corresponding regex? (i.e. the inverse of RegExr)


Here is the problem with that:

Consider the string abcdefgh

Guess what!? I have the perfect regex to match your string.

  "abcdefgh"

So given a string literal, there is always a regex to match that literal. Namely, the literal itself.

Really, what you want is a tool that, given several examples, will generate a regex that matches all of them.

So you'd give it:

  aaaaabaa
  aabaaa
  aba
  abaaaaa
And it'd generate "a+ba+"

The problem with that is, given a corpus with a set of tokens { T0, T1, T2 ... }, I can give you a regex that will match the corpus!

  "[T0 T1 T2 ... ]*"
or even

  ".*"
So it will match everything in your corpus! But unfortunately, it will match a whole lot you don't want, too.

So ideally you want a regex that matches everything in your corpus, but nothing outside the language you are trying to describe. This requires both positive and negative learning examples. The problem is that for most applications, you'd need a lot of negative examples.

Source: Working on this exact problem for graduate research


T0 | T1 | T2 | ... would match exactly the correct thing with all positive examples, and (T0 | T1 | T2) & !(CE1 | CE2 | CE3) would match exactly the correct thing with positive and negative examples.

But that's pretty stupid, because you don't generalize beyond your examples.

What's your approach?

<em>edit: removed random conjecture</em>


You have to have some sort of heuristic that determines what a "good" regex is, since there are undoubtedly multiple regexes that describe a corpus.

A simple heuristic is the smallest regex.

So in your example, given the training examples:

  aba
  abaa
  aaaaba
and the counter examples:

  abba
  ba
  ab
It's clear to a human I probably want to match "a+ba+". That's clearly much smaller than ("aba" | "abaa" | "aaaaba") & !("abba" | "ba" | "ab"), so it would be a "better" regex.


Sounds like you want to be able to specify some kind of pattern to define accepted and rejected matches. A regex would be ideal for this. oh wait....


Since you're a researcher I must be missing something. But since regexps are closed under union, what is the problem with taking the union of all of them? I'm imagining that it would be conceptually simple to hook up all of the non deterministic state machines such that you get a non deterministic state machine which is the union of all of them. Then convert it to a deterministic state machine. You might get state explosion, but at least you would have found some machine to recognize the language. Is state minimization simple (complexity wise)? Is it even possible to find a decently small DSM in the general case (not necessarily the most minimal machine)?


My reply to nmrm might answer your question.

Finding some regular expression that matches all of the positive examples and does not match all of the negative examples is trivial. Finding a good regular expression that does that is not.

State minimization does not mitigate this problem. As an aside, state minimization is a polynomial algorithm.

Given the positive examples:

  aba
  abaa
  aaaaba
and the negative examples:

  abba
  ba
  ab
we could make a regex that does something like ("aba" | "abaa" | "aaaaba") & !("abba" | "ba" | "ab"), but unfortunately, running a state minimization algorithm on this regex does not give you "a+ba+" because the two regex are not equivalent (they do not accept the same language).

So you can find plenty of regex that will match your examples and not match your counterexamples, but you cannot easily minimize them to what you do want.


"aaa" is a valid regex that matches the string "aaa". If you have special characters in your source string, many libraries have a regex for escaping them. So, generating a regex to match your exact string is trivial. Even matching a group of strings is trivial via (aaa|bbb|etc), though it gets long.

Given that, what I think you're really asking is, "how do I automatically generate a regex of optimal conciseness given a set of inputs I'd like to match, and maybe a bunch of other inputs I want to avoid matching?"

This looks like it iteratively does what you want: http://regex.inginf.units.it/ (Note that when I went there, it said "6 slots available", presumably because everything runs server-side. If a bunch of people pile in there, you probably won't actually be able to test it due to limited resources on their part.)


this is very cool ty for the post.


Probably not what you're looking for, but check out http://nbviewer.ipython.org/url/norvig.com/ipython/xkcd1313....


Regexp::Assemble kind of gets you there, you can feed it strings and it'll spit out a regex.


Reminds me of http://rubular.com/, except it isn't Ruby-focused and is more community-based. Seems pretty cool.


Or http://www.regexplanet.com/, but regex planet supports a lot more flavors.


Or http://re-try.appspot.com/ for a Python equivalent.


There is one for Javascript that I use pretty often: http://scriptular.com/ (based on the Ruby one: http://www.rubular.com/)


https://www.debuggex.com/ is a nice alternative that is a little different from other regex sites I've seen.


Why can I not match against "\w*" for instance? It just says "infinite" and does not seem to attempt to match.


Creator here - this is because \w* matches 0 characters, and thus matches infinitely. You can roll over the "infinite" error for details, or look in the help.

Try \w+ instead.


But \w* matches "" and "abc" but not "!a". How can I test this with your tool if \w* always says "infinite"?


one of the best regular expression testers online just got better. Great site, love it


a side note: i found Patterns app on OS X very useful for regex.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: