Crate pomelo_hir
source ·Expand description
High-level intermediate representation (HIR) for pomelo
.
Similar to the HIR in Rust, the HIR mainly
serves to desugar some derived language constructs to simplify semantic analysis.
The lowering from the AST to HIR is implemented in lower
.
Structure of the HIR
The main entry point to the HIR is File
, which contains the list of top level
declarations (Dec
) in the file.
Internally, the HIR is represented as a tree in an index-based arena – the idea is that this is
hopefully a little simpler to work with than the
rowan
-based AST that is outputted at the parsing stage.
HIR nodes (notably Dec
, Expr
, Pat
, and Ty
) are stored by a FileArena
.
The syntactic kinds of nodes are represented by DecKind
, ExprKind
, PatKind
, and
TyKind
.
References between nodes are represented as indexes (Idx<T>
) to nodes in
the arena, which conveniently means that we don’t have to worry about ownership problems (see
later notes about the tradeoff here).
Basic semantic information is tracked and embedded in the HIR during lowering from the AST.
For example, variable expressions are represented by an ExprKind::Vid { longvid: (LongVId, DefLoc), .. }
, where LongVId
represents an identifier and DefLoc
is a reference to the
declaration or pattern where the identifier is declared.
Type constructors (TyCon
) similarly come with a reference to their declaration.
(Note: this needs to be tested a lot more!)
The symbol table is held in the LoweringCtxt
, accessible through calling
LoweringCtxt::resolver
.
Besides being used to annotate the HIR with name resolution data, we also use this to distinguish
between infix expressions vs function applications.
This is not done in the parsing stage because it requires tracking which identifiers have been
declared infix
/infixr
/nonfix
.
The implementation uses Pratt parsing and is located in the lower::infix
module
(also done to distinguish infix vs constructed patterns).
The HIR does not currently have a way to go from a declaration to all of its references, which
is something that will be important for common IDE actions.
This could be baked in at the lowering stage as well by annotating variable declarations with
a list of usages.
For now, this is implemented separately in pomelo_ty
.
Lowering strategy and tradeoffs
Currently, the HIR is lowered in one pass and is contained all in a single graph. This is nice and simple, and also makes sense because the declarations in an SML file are sequential (in contrast to, e.g., a Rust module where the items may appear in any order).
Comparison to rustc
and rust-analyzer
I was confused about this for a while, so it might be worth writing a little more (at least
to help me remember why I did it this way).
The monolithic structure of the HIR here is very different to how things are done in
rustc
and rust-analyzer
, the two compilers I’ve spent the most time looking at for inspiration.
Both rustc
and rust-analyzer
make a strong conceptual division between Item
s and Body
s.
An Item
is anything that can appear at the
top level of a module (fn
, struct
, enum
, const
, mod
, etc.), while a Body
is the
stuff inside of the Item
that might need to be type checked (i.e., expressions live inside of bodies).
As mentioned previously, the order of Items
in a file doesn’t affect the meaning of a
Rust program (unlike in SML).
For example, we might have the following Rust function declaration in a module,
pub fn foo(x: Bar) -> Baz {
let y = 0;
let z = "something";
Baz::new()
}
The Item
here is the fn
declaration, which we can lookup to resolve the name foo
,
as well as lookup the functions signature, etc.
Meanwhile, the Body
is all the stuff inside.
Note that Item
s define names that are visible within the entire module,
while a declaration in a Body
s only has scope within that Body
.
In addition, we need to run type inference and checking on the contents of the Body
,
but not on Item
’s signature.
Importantly, changing the Body
of one Item
cannot impact the type-checking of the Body
of
a different Item
.
This all means that there is a pretty natural split between item declarations and expressions
in Rust.
If separate Item
s are stored independently, then we take advantage of the fact that typing
within one Body
doesn’t effect another Body
, and we don’t have to completely reanalyze the file
every time it is edited.
However, this actually doesn’t make too much sense for SML, since changing the order of top-level declarations in a file can substantially change the program’s meaning. At the very least, we probably have to redo type-checking on the entire file below the point where it was edited, so it’s probably not too much extra cost to redo the parsing and lowering of the entire file to HIR.
A note about immutability
Currently, this only covers the pure functional subset of Core SML.
This is convenient because immutability makes it easy to have the HIR be static single assignment (SSA).
Each variable usage (syntactically a VId
) is annotated with a DefLoc
that points to its
declaration.
The combination (VId, DefLoc)
is a unique identifier that points to only one place in the
HIR.
If we ever try to extend this to include imperative features, it will be interesting to see how
challenging it is to adapt the HIR.
TODO:
Handle errors during HIR lowering. Figure out a good way to surface them.
Figure out if validation passes should be run on HIR or AST.
Figure out if each HIR node needs to hold the index of its parent node.
Figure out logging for this stage and probably parsing too.