Quantcast

Escape Meta Chars in Character Classes

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Escape Meta Chars in Character Classes

Andreas Säger
Hi Fred,

Another regexp question:

In HSQL REGEXP_SUBSTRING ...

'\s+(\d{1,2})(,(\d{1,2}))*([.:;-](\d{1,2}))*$'

... matches some white space with 1, 2 or 3 numbers made of 1 or 2
digits at the end of string.
The first number is always separated by one comma from the second.
Between the 2nd and the 3rd number there may be any one of . : ; - but
at this point I'd like to add some additional characters, say + and /

> http://docs.oracle.com/javase/tutorial/essential/regex/literals.html

tells us that

<([{\^-=$!|]})?*+.>

are meta and that we can escaped them by a preceeding backslash or
enclosed in \Q and \E

Adding a slash to the character class:

'\s+(\d{1,2})(,(\d{1,2}))*([.:;-/](\d{1,2}))*$'

> java.util.regex.PatternSyntaxException: Illegal character range near index 31

Same with an escaped slash \/ and with \Q/\E too although the slash is
not even in the above list of meta characters.

And as far as I remember from other languages, we should be able to use
unescaped characters within a character class.

Oh, while playing around and writing this mail, I found out that the
position within the character class makes a difference.
Character class
[/+.:;-] works well with unescaped meta characters
[.:;-/+] raises an error
The critical characters must not appear at the end of the class.
Did I find a minor bug or is this behaviour anywhere specified?
Thanks for reading. I can live with that.
Andreas


------------------------------------------------------------------------------
_______________________________________________
Hsqldb-user mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/hsqldb-user
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Escape Meta Chars in Character Classes

Fred Toussi-2
Hi Andreas

I do not write regular expressions often enough to try and comment. When
you ask a question like this, provide an example of the input and the
required output so others can try and comment on it.
You may get better answers if ask your question at StackOverflow using
HSQLDB, Java and regular expression tags.

Fred

On Wed, Dec 16, 2015, at 13:01, Andreas Säger wrote:

> Hi Fred,
>
> Another regexp question:
>
> In HSQL REGEXP_SUBSTRING ...
>
> '\s+(\d{1,2})(,(\d{1,2}))*([.:;-](\d{1,2}))*$'
>
> ... matches some white space with 1, 2 or 3 numbers made of 1 or 2
> digits at the end of string.
> The first number is always separated by one comma from the second.
> Between the 2nd and the 3rd number there may be any one of . : ; - but
> at this point I'd like to add some additional characters, say + and /
>
> > http://docs.oracle.com/javase/tutorial/essential/regex/literals.html
>
> tells us that
>
> <([{\^-=$!|]})?*+.>
>
> are meta and that we can escaped them by a preceeding backslash or
> enclosed in \Q and \E
>
> Adding a slash to the character class:
>
> '\s+(\d{1,2})(,(\d{1,2}))*([.:;-/](\d{1,2}))*$'
>
> > java.util.regex.PatternSyntaxException: Illegal character range near index 31
>
> Same with an escaped slash \/ and with \Q/\E too although the slash is
> not even in the above list of meta characters.
>
> And as far as I remember from other languages, we should be able to use
> unescaped characters within a character class.
>
> Oh, while playing around and writing this mail, I found out that the
> position within the character class makes a difference.
> Character class
> [/+.:;-] works well with unescaped meta characters
> [.:;-/+] raises an error
> The critical characters must not appear at the end of the class.
> Did I find a minor bug or is this behaviour anywhere specified?
> Thanks for reading. I can live with that.
> Andreas
>
>
> ------------------------------------------------------------------------------
> _______________________________________________
> Hsqldb-user mailing list
> [hidden email]
> https://lists.sourceforge.net/lists/listinfo/hsqldb-user

------------------------------------------------------------------------------
_______________________________________________
Hsqldb-user mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/hsqldb-user
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Escape Meta Chars in Character Classes

Andreas Säger
Simplified and well working example:

CREATE MEMORY TABLE T ("PAIRS" CHAR(3));
INSERT INTO T VALUES('1,2');
INSERT INTO T VALUES('1\2');
INSERT INTO T VALUES('1+2');
INSERT INTO T VALUES('1/2');
INSERT INTO T VALUES('1.2');
INSERT INTO T VALUES('1:2');
INSERT INTO T VALUES('1;2');
INSERT INTO T VALUES('1-2');
INSERT INTO T VALUES('1&2');


SELECT REGEXP_SUBSTRING("PAIR",'\d[-+/&.:;]\d') FROM "T";

The regexp matches all records except the first and the second because
comma and backslash are not in the character class. It matches one
digit, followed by any record within character class [-+/&.:;], followed
by another digit. It works as expected without escaping special
characters because character classes are lists of literal characters.

But when you change the order of characters within the class, you get an
error

> java.util.regex.PatternSyntaxException: Illegal character range near index 10

when the last element of the character class is a metacharacter
according to
> http://docs.oracle.com/javase/tutorial/essential/regex/literals.html

> The metacharacters supported by this API are: <([{\^-=$!|]})?*+.>

or when the last character is a slash. May be there are even more "bad"
characters.

The following variants do raise the PatternSyntaxException and none of
the escape methods can fix it:
-- + at the end of class:
SELECT REGEXP_SUBSTRING("PAIR",'\d[-/&.:;+]\d') FROM "T";

-- . at the end of class:
SELECT REGEXP_SUBSTRING("PAIR",'\d[-/&:;+.]\d') FROM "T";

-- / at the end of class which is not even a metacharacter:
SELECT REGEXP_SUBSTRING("PAIR",'\d[-&:;+./]\d') FROM "T";

In other words: Certain characters need to be escaped when they are the
last element of a character class or you put a "good" character at the end.
Same as above but with last char escaped by \
-- \+ at the end of class:
SELECT REGEXP_SUBSTRING("PAIR",'\d[-/&.:;\+]\d') FROM "T";

-- \. at the end of class:
SELECT REGEXP_SUBSTRING("PAIR",'\d[-/&:;+\.]\d') FROM "T";

-- \/ at the end of class which is not even a metacharacter:
SELECT REGEXP_SUBSTRING("PAIR",'\d[-&:;+.\/]\d') FROM "T";

A backslash in a character class needs to be escaped by a leading
backslash in any case like the leading \\ in the following:
SELECT REGEXP_SUBSTRING("PAIR",'\d[\\-&:;+.\/]\d') FROM "T";


In my previous posting I wrote ...
> Same problem with an escaped slash \/ and with \Q/\E too ...


... which I can not reproduce anymore. Both escaping methods fix the
problem of "bad" char at the end of character classes.
After all, I think all this is somewhat "normal" but what is "normal" in
regexp context?



------------------------------------------------------------------------------
_______________________________________________
Hsqldb-user mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/hsqldb-user
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Escape Meta Chars in Character Classes

Kim Rydhof Thor Hansen

Just a guess, but it might be the - that has to be at the beginning or end of the character class if it is not meant to be a range.

Kim


On Wed, 16 Dec 2015 16:52 Andreas Säger <[hidden email]> wrote:
Simplified and well working example:

CREATE MEMORY TABLE T ("PAIRS" CHAR(3));
INSERT INTO T VALUES('1,2');
INSERT INTO T VALUES('1\2');
INSERT INTO T VALUES('1+2');
INSERT INTO T VALUES('1/2');
INSERT INTO T VALUES('1.2');
INSERT INTO T VALUES('1:2');
INSERT INTO T VALUES('1;2');
INSERT INTO T VALUES('1-2');
INSERT INTO T VALUES('1&2');


SELECT REGEXP_SUBSTRING("PAIR",'\d[-+/&.:;]\d') FROM "T";

The regexp matches all records except the first and the second because
comma and backslash are not in the character class. It matches one
digit, followed by any record within character class [-+/&.:;], followed
by another digit. It works as expected without escaping special
characters because character classes are lists of literal characters.

But when you change the order of characters within the class, you get an
error

> java.util.regex.PatternSyntaxException: Illegal character range near index 10

when the last element of the character class is a metacharacter
according to
> http://docs.oracle.com/javase/tutorial/essential/regex/literals.html

> The metacharacters supported by this API are: <([{\^-=$!|]})?*+.>

or when the last character is a slash. May be there are even more "bad"
characters.

The following variants do raise the PatternSyntaxException and none of
the escape methods can fix it:
-- + at the end of class:
SELECT REGEXP_SUBSTRING("PAIR",'\d[-/&.:;+]\d') FROM "T";

-- . at the end of class:
SELECT REGEXP_SUBSTRING("PAIR",'\d[-/&:;+.]\d') FROM "T";

-- / at the end of class which is not even a metacharacter:
SELECT REGEXP_SUBSTRING("PAIR",'\d[-&:;+./]\d') FROM "T";

In other words: Certain characters need to be escaped when they are the
last element of a character class or you put a "good" character at the end.
Same as above but with last char escaped by \
-- \+ at the end of class:
SELECT REGEXP_SUBSTRING("PAIR",'\d[-/&.:;\+]\d') FROM "T";

-- \. at the end of class:
SELECT REGEXP_SUBSTRING("PAIR",'\d[-/&:;+\.]\d') FROM "T";

-- \/ at the end of class which is not even a metacharacter:
SELECT REGEXP_SUBSTRING("PAIR",'\d[-&:;+.\/]\d') FROM "T";

A backslash in a character class needs to be escaped by a leading
backslash in any case like the leading \\ in the following:
SELECT REGEXP_SUBSTRING("PAIR",'\d[\\-&:;+.\/]\d') FROM "T";


In my previous posting I wrote ...
> Same problem with an escaped slash \/ and with \Q/\E too ...


... which I can not reproduce anymore. Both escaping methods fix the
problem of "bad" char at the end of character classes.
After all, I think all this is somewhat "normal" but what is "normal" in
regexp context?



------------------------------------------------------------------------------
_______________________________________________
Hsqldb-user mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/hsqldb-user

------------------------------------------------------------------------------

_______________________________________________
Hsqldb-user mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/hsqldb-user
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Escape Meta Chars in Character Classes

Andreas Säger
Am 20.12.2015 um 19:38 schrieb Kim Rydhof Thor Hansen:
> Just a guess, but it might be the - that has to be at the beginning or
> end of the character class if it is not meant to be a range.
>
> Kim
>
>

Yes, you are right, [&-/] matches everythin between + and . which
includes ASCII chars &'()*+,-./
So we have to escape certain characters when they occur at the end of a
character class and we need to take care of the from-to meaning of the dash.


Thanks for the heads up,
Andreas


------------------------------------------------------------------------------
_______________________________________________
Hsqldb-user mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/hsqldb-user
Loading...