Crate pomelo_hir

Expand description

High-level intermediate representation (HIR) for pomelo.

Similar to the HIR in Rust, the HIR mainly serves to desugar some derived language constructs to simplify semantic analysis. The lowering from the AST to HIR is implemented in lower.

Structure of the HIR

The main entry point to the HIR is File, which contains the list of top level declarations (Dec) in the file. Internally, the HIR is represented as a tree in an index-based arena – the idea is that this is hopefully a little simpler to work with than the rowan-based AST that is outputted at the parsing stage. HIR nodes (notably Dec, Expr, Pat, and Ty) are stored by a FileArena. The syntactic kinds of nodes are represented by DecKind, ExprKind, PatKind, and TyKind. References between nodes are represented as indexes (Idx<T>) to nodes in the arena, which conveniently means that we don’t have to worry about ownership problems (see later notes about the tradeoff here).

Basic semantic information is tracked and embedded in the HIR during lowering from the AST. For example, variable expressions are represented by an ExprKind::Vid { longvid: (LongVId, DefLoc), .. }, where LongVId represents an identifier and DefLoc is a reference to the declaration or pattern where the identifier is declared. Type constructors (TyCon) similarly come with a reference to their declaration. (Note: this needs to be tested a lot more!)

The symbol table is held in the LoweringCtxt, accessible through calling LoweringCtxt::resolver. Besides being used to annotate the HIR with name resolution data, we also use this to distinguish between infix expressions vs function applications. This is not done in the parsing stage because it requires tracking which identifiers have been declared infix/infixr/nonfix. The implementation uses Pratt parsing and is located in the lower::infix module (also done to distinguish infix vs constructed patterns).

The HIR does not currently have a way to go from a declaration to all of its references, which is something that will be important for common IDE actions. This could be baked in at the lowering stage as well by annotating variable declarations with a list of usages. For now, this is implemented separately in pomelo_ty.

Lowering strategy and tradeoffs

Currently, the HIR is lowered in one pass and is contained all in a single graph. This is nice and simple, and also makes sense because the declarations in an SML file are sequential (in contrast to, e.g., a Rust module where the items may appear in any order).

Comparison to `rustc` and `rust-analyzer`

I was confused about this for a while, so it might be worth writing a little more (at least to help me remember why I did it this way). The monolithic structure of the HIR here is very different to how things are done in rustc and rust-analyzer, the two compilers I’ve spent the most time looking at for inspiration. Both rustc and rust-analyzer make a strong conceptual division between Items and Bodys. An Item is anything that can appear at the top level of a module (fn, struct, enum, const, mod, etc.), while a Body is the stuff inside of the Item that might need to be type checked (i.e., expressions live inside of bodies). As mentioned previously, the order of Items in a file doesn’t affect the meaning of a Rust program (unlike in SML).

For example, we might have the following Rust function declaration in a module,

pub fn foo(x: Bar) -> Baz {
    let y = 0;
    let z = "something";
    Baz::new()
}

The Item here is the fn declaration, which we can lookup to resolve the name foo, as well as lookup the functions signature, etc.

Meanwhile, the Body is all the stuff inside. Note that Items define names that are visible within the entire module, while a declaration in a Bodys only has scope within that Body. In addition, we need to run type inference and checking on the contents of the Body, but not on Item’s signature. Importantly, changing the Body of one Item cannot impact the type-checking of the Body of a different Item.

This all means that there is a pretty natural split between item declarations and expressions in Rust. If separate Items are stored independently, then we take advantage of the fact that typing within one Body doesn’t effect another Body, and we don’t have to completely reanalyze the file every time it is edited.

However, this actually doesn’t make too much sense for SML, since changing the order of top-level declarations in a file can substantially change the program’s meaning. At the very least, we probably have to redo type-checking on the entire file below the point where it was edited, so it’s probably not too much extra cost to redo the parsing and lowering of the entire file to HIR.

A note about immutability

Currently, this only covers the pure functional subset of Core SML. This is convenient because immutability makes it easy to have the HIR be static single assignment (SSA). Each variable usage (syntactically a VId) is annotated with a DefLoc that points to its declaration. The combination (VId, DefLoc) is a unique identifier that points to only one place in the HIR. If we ever try to extend this to include imperative features, it will be interesting to see how challenging it is to adapt the HIR.