Author: | Shunsuke Sogame and Christopher Diggins(original author) |
---|---|
Contact: | pstade.mb@gmail.com |
License: | Distributed under the Boost Software License Version 1.0 |
Version: | 1.02.9 |
"Give him some of the biscuit which famous Rhodes has sent you."
—Marcus Valerius Martialis
I was looking for a small and unstrict xml parser. Boost.Spirit and Boost.Xpressive showed a lot of functionality but made big executables. On the other hand, YARD written by Christopher Diggins was simple, small and fast. In time, I found that YARD and Boost.Range could be binded. It was named Biscuit.
Biscuit is an object-oriented recursive-descent parser generator framework implemented using class templates. The templates allow us to author Extended Backus-Normal Form (EBNF) in C++. Additional information is available at YARD.
A simple EBNF grammar snippet:
group ::= '(' expression ')' factor ::= integer | group term ::= factor (('*' factor) | ('/' factor))* expression ::= term (('+' term) | ('-' term))*
is approximated using Biscuit's facilities as seen in this code snippet:
struct expression ; struct group : seq< char_<'('>, expression, char_<')'> > { }; struct factor : or_< integer, group > { }; struct term : seq< factor, star< or_< seq< char_<'*'>, factor >, seq< char_<'/'>, factor > > > > { }; struct expression : seq< term, star< or_< seq< char_<'+'>, term >, seq< char_<'-'>, term > > > > { };
Through the magic of the lazy template instantiation, these are perfectly valid types. The production rule expression is a type that has a static member function parse. As parse will be instantiated later by Algorithms, all you have to do is to declare a type.
Direct recurring types [1] also are valid:
struct S : or_< seq< chseq<'('>, S, chseq<')'> , S >, eps > { };
Note that left-recursions are not allowed, though compilers might detect them if templates are easy.
[1] | typedef is not available for such usage but can define a non-recurring parser inside a function unlike struct. |
Include Biscuit headers:
#include <pstade/biscuit.hpp> using namespace pstade; using namespace biscuit;
Define your own Parser type:
typedef seq< chseq<'/','*'>, star_until< any, chseq<'*','/'> > > c_comment;
Call Algorithms [2]:
std::string text("/* Hello, Biscuit! */"); if (biscuit::match<c_comment>(text)) { //... }
[2] | An unqualified call may trigger an unintentional ADL. You must always add biscuit::. |
A UserState is any type that is passed to Algorithms or Ranges as the second argument. Non-const rvalues are disallowed as UserState. If no object is passed to Algorithms, this type is null_state_type.
A Parser is any type that has the static member function:
template< class Unspecified, class UserState > static bool parse(Unspecified& , UserState& us);
A ParsingRange is a Forward Range that is passed to Algorithms or Ranges as the first argument and the value_type of the Forward Range must be Equality Comparable.
A ParsingSubRange [3] is a boost::iterator_range<boost::range_result_iterator<ParsingRange>::type>.
A SemanticAction is a Default Constructible Functor and the expression a(r, us) must be valid, where a is an object of type Semantic Action, r is an object of type Parsing SubRange and us is an object of type User State.
A ValueFunctor is a Default Constructible Functor and the expression v(us) must be valid, where v is an object of type Value Functor and us is an object of type User State.
[3] | Parsing SubRange is not defined as boost::sub_range for broken compilers, but you can catch it using boost::sub_range<ParsingRange>. Note that boost::sub_range is not always assignable under eVC4 and VC8 because of their bugs. |
Some Parser templates are predefined as a means for Parser composition and embedding.
The table below lists EBNF and their equivalents in Biscuit.
EBNF (or Perl) Biscuit Meaning . any [4] any object ^ begin beginning of parsing range $ end end of parsing range A | B or_<A, B> alternation of A and B A B seq<A, B> sequence of A and B A* star<A> zero or more times, greedy A+ plus<A> one or more times, greedy A? opt<A> zero or one time, greedy A & B and_<A, B> match A, and the matching-range matches B A - B minus<A, B> match A, but the matching-range doesn't match B (A) capture<1, A> capture a back-reference \1 backref<1> a previously captured back-reference A{n,m} repeat<A, n, m> between n and m times, greedy A*? B star_until<A, B> zero or more As, non-greedy, and B A*? (?=B) star_before<A, B> same as star_until< A, before<B> > (?=A) before<A> positive look-ahead assertion (?!A) not_< before<A> > negative look-ahead assertion a char_<'a'> a character b wchar<L'b'> a wide-character c bchar<long,0x63> a type-specified character Diggins chseq<'D','i','g','g','i','n','s'> a string MB wchseq<L'M',L'B'> a wide-string MB bchseq<long,0x4d,0x42> a type-specified string [0-9] chrng<'0','9'> characters in range '0' through '9' [abc] chset<'a','b','c'> characters 'a','b' or 'c' [0-9abc] or_< chrng<'0','9'>, chset<'a','b','c'> > characters 'a','b','c' or in range '0' though '9' [^abc] not_< chset<'a','b','c'> > [5] not characters 'a','b' or 'c' /Diggins/i ichseq<'D','i','g','g','i','n','s'> a case-insensitive string, locale-insensitive \w alnum a word \W not_<alnum> not a word \d digit a digit \D not_<digit> not a digit \s space a space \S not_<space> not a space $ eol a literal newline or end of parsing range ??? line character sequence before eol ??? identity<A> same as A, delay definition of A when deriving ??? eps match the empty range ??? nothing never match anything
YARD and Biscuit have no back-tracking on star operations. Instead star_until or star_before are available. The default template arity is twenty. If you want more arity, define PSTADE_BISCUIT_CFG_NO_PREPROCESSED_HEADERS and PSTADE_BISCUIT_LIMIT_PARSER_ARITY before Biscuit headers. VC++7.1 limits PSTADE_BISCUIT_LIMIT_PARSER_ARITY to 64. Note that the big arity tends to make internal compiler errors.
The C++ Standard doesn't allow you to pass a string literal to templates. chseq template parameter arity is limited. For that workaround, PSTADE_BISCUIT_SEQ_LITERAL macro is provided:
PSTADE_BISCUIT_SEQ_LITERAL(hello, "hello") PSTADE_BISCUIT_SEQ_LITERAL(bye, L"bye") void test_literal() { BOOST_CHECK( biscuit::match<hello>(std::string("hello")) ); BOOST_CHECK( biscuit::match<bye>(std::wstring(L"bye")) ); }
PSTADE_BISCUIT_SEQ_LITERAL defines a Parser whose name is the first argument by using the string literal that is passed as the second argument.
actor template creates a Parser that triggers a Semantic Action object:
struct decorate_action { void operator()(boost::sub_range<std::string> rng, std::stringstream& out) { out << "[" << boost::copy_range<std::string>(rng) << "]"; boost::to_upper(rng); } }; struct xml_comment : seq< chseq<'<','!','-','-'>, star< or_< minus< any, chseq<'-'> >, seq< chseq<'-'>, minus< any, chseq<'-'> > > > >, chseq<'-','-','>'> > { }; struct xml_comment_action : actor< xml_comment, decorate_action > { }; void test_actor() { std::string text("<!-- xml comment -->"); std::stringstream out; BOOST_CHECK( biscuit::match<xml_comment_action>(text, out) ); BOOST_CHECK( oven::equal(out.str(), "[<!-- xml comment -->]") ); BOOST_CHECK( oven::equal(text, "<!-- XML COMMENT -->") ); }
Parsing SubRange can be assigned to boost::sub_range<ParsingRange> idiomatically. If a Parsing Range is mutable, its Parsing SubRange also is mutable. Note that a copy of boost::sub_range is only copies of two iterators.
Directives are also Parsers which contain some ports of Boost.Spirit's Directives.
Boost.Spirit Biscuit Meaning lexeme_d[A] impossible turn off white space skipping as_lower_d[A] as_lower<A> parsing range is transformed to lower-case ??? as_filtered<A, Parser> parsing range is filtered using parser ??? as_transformed<A, UnaryFunction> [6] parsing range is transformed using functor ??? lazy_actions<A> suppress non-intended actions by parsing twice no_actions[A] no_actions<A> semantic actions not fire longest_d[A|B] longest<A, B> choose the longest match shortest_d[A|B] shortest<A, B> choose the shortest match limit_d[A] require<A, Predicate> [7] parsing range requires predicate is constrained
[4] | There is a debate over whether or not to define Parsers as class templates even if no parameters. If you want such parsers, define PSTADE_BISCUIT_CFG_NULLARY_CLASS_TEMPLATE_PARSER before Biscuit headers. |
[5] | not_ can be applied only to one character Parser with a few exceptions. |
[6] | UnaryFunction requirements are the same as boost::transform_iterator's. |
[7] | Predicate must conform to Semantic Action and the valid expression must be convertible to bool. |
Algorithms of Biscuit work on Forward Range. Note that Parsers don't know value_type of the range. For instance, a Parser chseq works properly if value_type of the range is comparable with char.
biscuit::match returns true if a Parser runs through the range; otherwise false:
BOOST_CHECK( biscuit::match<xml_comment>("<!-- hello, xml comment -->"|oven::as_c_str) ); BOOST_CHECK( !biscuit::match<xml_comment>("<!-- not well-formed comment -- -->"|oven::as_c_str) );
Notice that a null-terminated string is no longer a model of Range with Boost 1.35. oven::as_c_str is provided for the workaround.
biscuit::search returns the first occurence of the matching Parsing SubRange. If not found, it returns boost::make_iterator_range(boost::end(r), boost::end(r)), where r is an object of type Parsing Range:
std::string text(" int i; /* c comment */ int j; "); boost::sub_range<std::string> rng = biscuit::search<c_comment>(text); BOOST_CHECK( oven::equal(rng, "/* c comment */") );
While biscuit::match returns only whether or not to succeed, biscuit::parse returns Parsing SubRange that a Parser runs through. If a parsing fails, it returns boost::make_iterator_range(boost::begin(r), boost::begin(r)), where r is an object of type Parsing Range.
filter_range is a Forward Range that is filtered by Parser:
std::string text(" /* c comment no.1 */ int i; /* c comment no.2 */ i = 1; /* c comment no.3 */ ++i; "); filter_range<c_comment, std::string> comments(text); BOOST_CHECK(( biscuit::match< repeat<c_comment, 3> >(comments) ));
A chain of filter_range works properly:
BOOST_CHECK(( biscuit::match< chseq<'x','y','z'> >( biscuit::make_filter_range< not_< chset<'&','.','%'> > >( biscuit::make_filter_range< not_<space> >( biscuit::make_filter_range< not_<digit> >( oven::as_c_str("x & 4 y . 125 % z") ) ) ) ) ));
filter_range is a model of Forward Range that can be passed to Algorithms. That's why Biscuit doesn't provide anything like Boost.Spirit's Scanner.
Range adapter syntax also is supported by filtered:
BOOST_CHECK(( biscuit::match< repeat< char_<'D'>, 3 > >( "abcdabcdabcd" | oven::as_c_str | biscuit::filtered< not_< char_<'a'> > >() | biscuit::filtered< not_< char_<'b'> > >() | biscuit::filtered< not_< char_<'c'> > >() | oven::transformed(to_upper) ) ));
token_range is a Forward Range whose value_type is a matching Parsing SubRange:
std::string text(" /* c comment no.1 */int i; /* c comment no.2 */i = 1; /* c comment no.3 */ ++i; "); token_range<c_comment, std::string> comments(text); BOOST_FOREACH (boost::sub_range<std::string> rng, comments) { std::cout << boost::copy_range<std::string>(rng) << std::endl; } BOOST_FOREACH ( boost::iterator_range<const char *> rng, " /* c comment no.1 */int i; /* c comment no.2 */i = 1; /* c comment no.3 */ ++i; " | oven::as_c_str | biscuit::tokenized<c_comment>() ) { std::cout << boost::copy_range<std::string>(rng) << std::endl; }
Outputs:
/* c comment no.1 */ /* c comment no.2 */ /* c comment no.3 */
As token_range conforms to Forward Range, BOOST_FOREACH planned to be a member of Boost works properly.
capture and backref create Parser for the capturing. results_xxx [8] Algorithms are provided for accessing matching results after parsing:
match_results<std::string> caps; std::string text("abcxabcx"); BOOST_CHECK(( biscuit::results_match< seq< capture< 1, star_before< any, char_<'x'> > >, char_<'x'>, capture< 2, backref<1> >, char_<'x'> > >(text, caps) )); BOOST_CHECK( oven::equal(caps[1], "abc") ); boost::to_upper(caps[1]); BOOST_CHECK( oven::equal(text, "ABCxabcx") ); boost::to_upper(caps[2]); BOOST_CHECK( oven::equal(text, "ABCxABCx") );
match_results<r>, where r is a Parsing Range, conforms to Pair Associative Container and Unique Associative Container except for their constructor expressions, whose key_type is int and mapped_type is Parsing SubRange of r.
[8] | You may want to use match_results as User State, so the overloading was rejected. |
Biscuit doesn't make any assumption about value_type of Parsing Range, but non-type template parameters are so limited. What if value_type of Parsing Range is not char but Screamer type of your dungeon game? What if matching patterns are loaded on runtime for your mouse-gesture program? The only way is to extract values from User State.
valseq makes a sequential Parser from Value Functors:
template< int i > struct value_at { std::string& operator()(std::vector<std::string>& values) { return values.at(i); } }; void test_valseq() { using namespace boost::assign; std::vector<std::string> inputs; { inputs += "ghi", "abc"; } std::vector<std::string> values; { values += "abc", "def", "ghi"; } BOOST_CHECK(( biscuit::match< valseq< value_at<2>, value_at<0> > >(inputs, values) )); }
valset makes an alternation Parser from Value Functors:
std::vector<std::string> inputs0; { inputs0 += "abc"; } std::vector<std::string> inputs1; { inputs1 += "def"; } std::vector<std::string> inputs2; { inputs2 += "ghi"; } std::vector<std::string> values; { values += "abc", "def", "ghi"; } BOOST_CHECK(( biscuit::match< valset< value_at<0>, value_at<1>, value_at<2> > >(inputs0, values) )); BOOST_CHECK(( biscuit::match< valset< value_at<0>, value_at<1>, value_at<2> > >(inputs1, values) )); BOOST_CHECK(( biscuit::match< valset< value_at<0>, value_at<1>, value_at<2> > >(inputs2, values) ));
seq_range is more flexible than valseq. seq_range<V> makes a sequential Parser, where V is a Value Functor whose function call expression must be, whether reference or not, a Forward Range whose value_type is comparable with Parsing Range's:
struct pattern_loader { std::string& operator()(std::string& pattern) { pattern += "c"; return pattern; } }; void test_seq_range() { std::string pattern("ab"); BOOST_CHECK(( biscuit::match< seq< chseq<'x'>, seq_range<pattern_loader>, chseq<'y'> > >("xabcy"|oven::as_c_str, pattern) )); }
set_range is provided in a similar way:
std::string pattern("ab"); BOOST_CHECK(( biscuit::match< seq< chseq<'x'>, set_range<pattern_loader>, chseq<'y'> > >("xay"|oven::as_c_str, pattern) )); BOOST_CHECK(( biscuit::match< seq< chseq<'x'>, set_range<pattern_loader>, chseq<'y'> > >("xby"|oven::as_c_str, pattern) )); BOOST_CHECK(( biscuit::match< seq< chseq<'x'>, set_range<pattern_loader>, chseq<'y'> > >("xcy"|oven::as_c_str, pattern) ));
Biscuit emulates Boost.Spirit's debugging:
struct factor : debugger<factor, or_< integer, seq< chseq<'('>, expression, chseq<')'> >, actor< seq< chseq<'-'>, factor >, do_negate >, seq< chseq<'+'>, factor > > > { };
debugger Parser uses the type-name of the first argument for outputs which automatically disappear on release-compile.
Outputs:
1+2 struct calculator_debug::start: "1+2" struct calculator_debug::expression: "1+2" struct calculator_debug::term: "1+2" struct calculator_debug::factor: "1+2" struct calculator_debug::integer: "1+2" push 1 /struct calculator_debug::integer: "+2" /struct calculator_debug::factor: "+2" /struct calculator_debug::term: "+2" struct calculator_debug::term: "2" struct calculator_debug::factor: "2" struct calculator_debug::integer: "2" push 2 /struct calculator_debug::integer: "" /struct calculator_debug::factor: "" /struct calculator_debug::term: "" popped 1 and 2 from the stack. pushing 3 onto the stack. /struct calculator_debug::expression: "" /struct calculator_debug::start: "" ------------------------- Parsing succeeded result = 3 -------------------------
Biscuit emulates Boost.Spirit's Error Handling:
struct handler { template< class State, class UserState > error_status operator()(State&, UserState&, boost::sub_range<std::string> rng, int id) { BOOST_CHECK( id == 3 ); boost::to_lower(rng); std::cout << "exception caught...Test concluded successfully" << std::endl; return error_retry; } }; void test_error_handling() { typedef guard< seq< chseq<'a','b','c'>, expect< 3, chseq<'d'> > >, handler > start; std::string text("abcD"); BOOST_CHECK( biscuit::match<start>(text) ); BOOST_CHECK( oven::equal(text, "abcd") ); }
Semantic Actions aside, the exception-safety of Biscuit depends on Parsing Range. In turn, if operators of the Parsing Range provide the basic exception-safety, Biscuit provides it. If the operators provide the strong exception-safety, Biscuit provides it.
Oven is the Boost.Range extension library. It provides some predefined ranges for Biscuit:
#include <pstade/oven.hpp> // ... BOOST_FOREACH (int i, oven::count_from_to(1, argc)) { BOOST_CHECK(( biscuit::match<xml_grammar::start>( oven::file_range<boost::uint8_t>(argv[i]) | oven::utf8_decoded<boost::uint32_t>() ) )); }
YARD and Biscuit are the examples of "composing inlined algorithms" that Boost.MPL TODO list shows. Biscuit parsers are expression templates that are made by hand, which tend to make smaller executables. Boost.Spirit and Boost.Xpressive automatically create expression templates by using operator overloads. In my opinion, such overloads couldn't increase readability as expected.