Блог*

#prog #parsing

libphonenumber — библиотека для парсинга и форматирования телефонных номеров. Почему это необходимо? Ну, потому что телефонные номера — запутанная вещь.

GitHub

GitHub - google/libphonenumber: Google's common Java, C++ and JavaScript library for parsing, formatting, and validating international…

Google's common Java, C++ and JavaScript library for parsing, formatting, and validating international phone numbers. - google/libphonenumber

1.11K viewsedited 10:19

Блог*

#prog #regex #parsing #haskell

Fix-ing regular expressions

TL;DR: We add variables, let bindings, and explicit recursion via fixed points to classic regular expressions. It turns out that the resulting explicitly recursive, finitely described languages are well suited for analysis and introspection.

Если читали мои претензии к регулярным выражениям, то наверняка помните, что одной из моих претензий было отсутствие декомпозируемости. В статье это решается за счёт добавления явных переменных (точнее, привязок в форме let x = r in re). Также добавление явного fix расширяет возможности языка и даёт возможность использовать рекурсию. На практике это означает, что подобное решение в состоянии распарсить известный пример грамматики, являющейся контекстно-зависимой: aⁿbⁿ.

К сожалению, не обошлось без ложки дёгтя: даже с некоторыми оптимизациями производительность итогового решения оставляет желать лучшего — что, в принципе, вполне ожидаемо, учитывая, что распознавание использует подход regular expression derivatives, который при применении в лоб приводит к сильному распуханию промежуточных структур данных. А ещё лично меня несколько расстраивает использование в финальной версии fix и полиморфной рекурсии — это означает, что итоговый результат не переносим в лоб на другие ЯП.

👎2👍1

1.19K viewsedited 21:30

Блог*

#prog #parsing #article

Obliteratingly Fast Parser Combinators (pdf)

Lexers and parsers are typically defined separately and connected by a token stream. This separate definition is important for modularity, but harmful for performance.
We show how to fuse together separately-defined lexers and parsers, drastically improving performance without compromising modularity. Our staged parser combinator library, flap, provides a standard parser combinator interface, but generates psecialized token-free code that runs several times faster than ocamlyacc on a range of benchmarks.

(thanks @clayrat)

Blogspot

Obliteratingly Fast Parser Combinators

Jeremy Yallop , Ningning Xie , and I have a new draft out about combinator parsing, entitled Fusing Lexing and Parsing . The abstract is sh...

🔥6

1.66K viewsedited 20:29

Блог*

#prog #parsing #article

Why you should not use (f)lex, yacc and bison

Strumenta

Why you should not use (f)lex, yacc and bison - Strumenta

Flex and Bison works, but every day they become less of a good choice. There are much better options and we will explain why.

👍1

1.01K views18:46

Блог*

#prog #parsing #article

Faster general parsing through context-free memoization (pdf, thanks @zukaboo)

(thanks @grammarware, @GabrielFallen)

We present a novel parsing algorithm for all context-free languages. The algorithm features a clean mathematical formulation: parsing is expressed as a series of standard operations on regular languages and relations. Parsing complexity w.r.t. input length matches the state of the art: it is worst-case cubic, quadratic for unambiguous grammars, and linear for LR-regular grammars. What distinguishes our approach is that parsing can be implemented using only immutable, acyclic data structures. We also propose a parsing optimization technique called context-free memoization. It allows handling an overwhelming majority of input symbols using a simple stack and a lookup table, similarly to the operation of a deterministic LR(1) parser. This allows our proof-of-concept implementation to outperform the best current implementations of common generalized parsing algorithms (Earley, GLR, and GLL). Tested on a large Java source corpus, parsing is 3–5 times faster, while recognition—35 times faster.

Честно скажу, я прочитал по диагонали, но фактически алгоритм выглядит, как parsing with derivatives, но сделанный правильно.

ACM Conferences

Faster general parsing through context-free memoization | Proceedings of the 41st ACM SIGPLAN Conference on Programming Language…

1.66K viewsedited 19:57

Блог*

#prog #parsing #article

Parser generators vs. handwritten parsers: surveying major language implementations in 2021

Developers often think parser generators are the sole legit way to build programming language frontends, possibly because compiler courses in university teach lex/yacc variants. But do any modern programming languages actually use parser generators anymore?

To find out, this post presents a non-definitive survey of the parsing techniques used by various major programming language implementations.

👎3❤2

1.1K views18:49

Блог*

#prog #parsing #rust #article

Practical Parsing in Rust with nom

Misaligned Reward Function

Practical Parsing in Rust with nom

Parsing doesn't have to be complicated. In this tutorial, I'll demonstrate nom's simple functional approach, making parsers easy for practical use cases.

689 views19:15

Блог*

#prog #parsing #rust #rustlib

Chumsky

A parser library for humans with powerful error recovery.

Пример парсера brainfuck:

use chumsky::prelude::*;

#[derive(Clone)]
enum Instr {
    Left, Right,
    Incr, Decr,
    Read, Write,
    Loop(Vec<Self>),
}

fn parser() -> impl Parser<char, Vec<Instr>, Error = Simple<char>> {
    recursive(|bf| {
        choice((
            just('<').to(Instr::Left),
            just('>').to(Instr::Right),
            just('+').to(Instr::Incr),
            just('-').to(Instr::Decr),
            just(',').to(Instr::Read),
            just('.').to(Instr::Write),
            bf.delimited_by(just('['), just(']')).map(Instr::Loop),
        ))
        .repeated()
    })
}

GitHub

GitHub - zesterer/chumsky: Write expressive, high-performance parsers with ease.

Write expressive, high-performance parsers with ease. - zesterer/chumsky

🔥12

666 views19:34

Блог*

#prog #parsing #rust #article

Parsing numbers into base-10 decimals with SIMD

When parsing number-dense JSON, much of the overhead is parsing numbers instead of parsing the JSON itself. For the typical case, most of the overhead within number parsing is converting into a form mantissa * 10^-exponent, not transforming into a proper floating point value.

Looking through high-quality number parsers such as rust_decimal’s or simd_json’s, they tend to avoid SIMD or only use it after ensuring there’s a long enough series of decimal characters. They also don’t support parsing in batches and concern themselves with problems like

* How do I avoid reading off the end of the number?
* How can I give precise error messages?
* How do I handle extremely long decimals?
* How do I handle separators?

In realistic parses, most of the above never apply! Invalid data is rare, long numbers are rare, weird separators are rare, and most numbers aren’t at the end of a buffer (with padding, none are). We can relax the above constraints and get something friendlier to a SIMD implementation. First, we assume:

1. All inputs have an extra 16 bytes of data off the end
2. Precise error reporting is not required
3. Inputs are not longer than 16 bytes
4. Inputs only contain digits and potentially 1 ‘.’
5. The sign is already processed (it’s just a simple check and pointer adjustment)
6. Optionally, we can parse many numbers at once

While these restrictions appear limiting, they’ll handle almost all actual cases, and a fallback can take the rest. With these considerations, we can go about writing a vectorized parser.

<...>

The elephant in the room is figuring out if this optimized parser is faster in real-world settings. It’s one thing to be faster in a very tight microbenchmark loop; replicating performance improvements in production is more challenging. Microbenchmarks can exacerbate coincidental interactions with microarchitecture that wouldn’t be relevant in the real world. Especially in parsing, there’s lots of complex, branchy, unpredictable surrounding code that complicates an otherwise clean sequence of instructions.

I’m experimenting with adding this to the rust port of simd-json, and adding unsafe fast-path parsers to rust_decimal. Hopefully we’ll see the desired real-world gains!

vgatherps.github.io

Parsing numbers into base-10 decimals with SIMD

Using SIMD to accelerate decimal parsing

🔥1

632 viewsedited 23:03

Блог*

#prog #python #parsing #article

Now You Have Three Problems (перевод)

О реализации парсер-комбинаторов на Python и о том, какие неочевидные премущества они могут нести

👍10💩1

1.13K views19:12

Блог*

#prog #rust #parsing #article

Resilient LL Parsing Tutorial

In this tutorial, I will explain a particular approach to parsing, which gracefully handles syntax errors and is thus suitable for language servers, which, by their nature, have to handle incomplete and invalid code. Explaining the problem and the solution requires somewhat less than a trivial worked example, and I want to share a couple of tricks not directly related to resilience, so the tutorial builds a full, self-contained parser, instead of explaining abstractly just the resilience.

The tutorial is descriptive, rather than prescriptive — it tells you what you can do, not what you should do.

* If you are looking into building a production grade language server, treat it as a library of ideas, not as a blueprint.
* If you want to get something working quickly, I think today the best answer is “just use Tree-sitter”, so you’d better read its docs rather than this tutorial.
* If you are building an IDE-grade parser from scratch, then techniques presented here might be directly applicable.

matklad.github.io

Resilient LL Parsing Tutorial

In this tutorial, I will explain a particular approach to parsing, which gracefully handles syntax errors and is thus suitable for language servers, which, by their nature, have to handle incomplete and invalid code.
Explaining the problem and the solution…

🔥2🤔1🖕1

836 views20:51

Блог*

#prog #parsing #article

В Just write a parser (примечание к небольшому набору уроков по реализации парсера) автор выступает против глубокого изучения подходов к парсингу в курсах по обучению компиляторостроения и за то, чтобы забить на "академические" подходы и делать рекурсивный спуск руками. В качестве аргументов он приводит:

* Рекурсивный спуск очень хорошо даёт понять, как работает парсер, в отличие от генераторов парсеров.
* Это практичный навык: парсеры реальных компиляторов написаны вручную.
* Это не так уж и сложно.

В статье Why We Need to Know LR and Recursive Descent Parsing Techniques Laurence Tratt соглашается с первым тезисом (не перегружать курс по компиляторам подходами к парсингу), но оспаривает второй.

Именно, он указывает на LR-грамматики: такие грамматики точно являются однозначными, а то, является ли грамматика LR, можно проверить автоматически. Рекурсивный спуск же не даёт ясно понять, какова реализуемая парсером грамматика. В частности, реализуемая парсером грамматика может быть неоднозначной. Более того, то, что самописный рекурсивный спуск сам по себе вполне детерминирован, означает, что парсер таки разрешает неоднозначности грамматики, причём зачастую неочевидным способом.

Автор проводит параллель между валидатором грамматики и тайпчекером: прохождение проверки означает гарантию отсутствия определённых проблем (неоднозначностей грамматики и ошибок несовпадения типов соответственно). Более того, как он отмечает:

> If writing a grammar in LR style is hard for you as the language author, it will almost certainly be hard for users to understand!

Разумеется, автор признаёт, что не для всех языков возможно написать LR-грамматику в принципе, но он отмечает, что попытка написать такую грамматику может как минимум подсказать, где в реализации парсера могут крыться баги из-за неоднозначности.

Разумеется, это не всё, что есть в статье, так что советую прочитать целиком.

tiarkrompf.github.io

Tiark's Notebook

Tiark Rompf is an Associate Professor at Purdue University. Notes and blog posts on programming, research, CS, software systems.

👍11❤1🔥1

1.22K views22:37

Блог*

#prog #parsing #article

LL and LR in Context: Why Parsing Tools Are Hard

blog.reverberate.org

LL and LR in Context: Why Parsing Tools Are Hard

Parsing, performance, and low-level programming.

🤔4🌚1

1.29K views18:19

About

Blog

Apps

Platform