# IPLD Selectors

This document is a designdoc for IPLD Selectors.

# [Meta: Status of this doc]

  • This was written around 2018-10-16 (video presentation (opens new window))
  • It narrows down the decision space enough to make significant progress.
  • good enough for trying out an implementation to learn more and make choices.
  • But it is not complete.
  • some choices that need to be made:
    • [ ] select general binary and string format structure for selectors. (options given here)
    • [ ] binary and string formats for each selector. doesn't have to be here.
    • [ ] whether to dump all selector codes into multicodec table, or one code.
    • [ ] which S expression selector variant to use (may be out of scope for this doc)
  • more prose here may help implementors.

# Motivation - what are Selectors?

Prerequisites: IPLD (opens new window), IPLD data model, CID (opens new window).

IPLD Selectors are expressions that identify ("select") a subset of nodes in an IPLD dag.

This is a useful primitive to use along with: (a) systems that require distributing or pinning dags (IPFS, Filecoin, bitswap, graphsync, ipfs-cluster), (b) applications that require fetching subsets of data in specific orders or at specific times (video players, dataset viewers, file systems), (c) programs that transform graphs into other graphs (data transformations, ETL, etc). In short, it is a fundamental primitive required by most systems and applications in the IPLD and IPFS ecosystems, as important as multihash (opens new window), CIDs (opens new window), IPLD Formats (opens new window), and more.

# Prior Work

There have been a lot of articulations about selectors in the history of the IPFS and IPLD projects. Many documents exist extolling all the kinds of use cases for selectors. Instead of giving a complete articulation here, this document will mention a few use cases and link to the other documents.

# Important Design Notes

Learn from the best. There have been dozens of graph selector systems implemented over the last few decades. There have been only a few that have been extremely successful, and productive. Study and learn from these systems. These include: unix globs, regular expressions, XPath, css selectors, and more.

Selectors include paths to the root. In most cases using IPLD Selectors, programs need to be able to verify authenticated data structures (they need to hash all the nodes in a path, all the way to the root). Therefore, for all these distributed, authenticated use cases, it is drastically easier to count on Selectors that yield a result that is verifiable (checkable against the original query -- the root).

Self-describe and use multiple types. Over the years, there has been much designing, without arriving at a complete, perfect solution. We have now recognized this as a feature where self-description and upgradability fit better than forcing a single language: there should just be many types of selectors and applications should use the ones that fit their needs. Due to the wide variety of use cases, we will not have a single perfect syntax. This will become easier over time, as systems that use IPLD acquire the ability to execute authenticated code.

Aim for succinct, intuitive, human-readable, expressive power. The most successful selector systems (unix globs, regular expressions, css selectors, etc.) have created very poweful and succinct expressions, that balance the nuance between making something human-friendly and highly efficient. Understandability, intuitiveness, expressivity, familiarity, and similar qualities are desired in the human-readable syntax. Self-description and types make this drastically easier, enabling whole new types to be made over time as better and better syntaxes are discovered, and even existing familiar syntaxes to be ported over.

Aim for succinct, efficient, self-describing binary representations. Various systems that use IPLD Selectors may need to distribute and store billions of selectors. Succinctness of expression is key. Self-description and types make this drastically easier, enabling whole new types to be made over time as more efficient syntaxes are discovered.

Blocks have CIDs. IPLD Nodes have Paths. It is very important that we align on the following two statements: (1) Blocks have CIDs. (2) IPLD Nodes have paths, not CIDs. This is important because it means that IPLD nodes may begin within the middle of a block, and may not be addressable directly by a CID. This is a key realization, as Selectors must always support Paths as a dag root, not just a CID.

# IPLD Selector System

# Narrowing down the problem

The problem of IPLD Selectors is narrowed down to a very concrete, simple problem:

  • How can we succinctly identify a dag subset, connected to the root?

These other questions are solved in other systems, above or below IPLD Selectors:

  • How can we represent arbitrary data structures or files in dags? (IPLD data model, user code)
  • How can we succinctly identify a dag subset, not connected to the root (User code on top of IPLD selectors)
  • How can we traverse a dag subset? (IPLD library implementations, using IPLD Selectors)
  • How can we make subset selection efficient? (IPLD library implementations)
  • How can we make selector combination efficient? (IPLD library implementations)
  • How can we strip a dag subset from links to objects not in the subset (IPLD Transformations, user code)
  • How can we convert one dag representation into another? (IPLD Transformations, user code)
  • How can we agree with multiple other parties who will pin or send subets of a dag? (ipfs-cluster, graphsync)

With our problem narrowed down, we can focus on solving just that, and let all other concerns be solved elsewhere in the stack.

# Selector Requirements

  • P0 - Selectors can express any valid dag subset (connected to the root).
  • P0 - Selectors can select based on IPLD paths.
  • P1 - Selectors can select based on values.
  • P2 - Selectors can select with seeded pseudorandomness.
  • P3 - Selectors can select based on sibling nodes (as in css3) and their descendants (as in css4).
  • P1 - Selectors can be composed.
  • P0 - Once written, selectors should work permanently. (selector code should not change under the users).
  • P1 - Selector languages can evolve and improve over time (faster than other selector systems).
  • P3 - Allow experimentation without requiring backward-compatible syntax (different from globs and css).
  • P0 - There is a 1-1 mapping between human readable and binary syntaxes.
  • P0 - Binary syntax is self-described, succinct, and efficient.
  • P0 - Human readable syntax is powerful and expressive.
  • P1 - Human readable syntax is intuitive and familiar.

Some of these may seem mutually exclusive. They are not.

# Approach: Selector Types

The requirements stated above are hard to meet. We have spent lots of time in the last few years trying to reconcile them into one language and syntax, with no success. Earlier this year (2018) we recognized that the solution to this problem should be flexibility and interoperability: let many selector languages and syntaxes flourish, and let them evolve over time. This would allow us to satisfy all constraints above, including both a permanent model that can also improve over time. And it reduces the core of our system into three components:

  • (1) A system of selector types, that allows creating new selector languages and syntaxes, and can compose them.
  • (2) An easy path for plugging selector types into IPLD libraries and other consumers of IPLD Selectors.
  • (3) A light process for testing and admitting new selector types into standard IPLD libraries.

These components imply or expand into the following things:

  • Well-defined binary and human-readable type self-description (codes in multicodec).
  • A narrow Selector interface for most uses of selectors, agnostic to selector type.
  • A standard way to add selector type implementations to IPLD libraries
  • IPLD libraries that pervasively uses the abstract Selector type, and can plug in concrete types.
  • A few simple selector types that cover most common cases (cid, path, glob, ...)
  • A selector type to allow composing selectors (MultiSelector)
  • Aim for language independent implementations of selector implementations (parsers, execution, etc). (WASM?)
  • Allow language-specific implementations of selectors (parsers, execution, etc).
  • (IMPORTANT) Well-designed set of test vectors representing a variety of use cases for IPLD Selectors.
  • A recommended structure for implementing a selector type, with an easy to use test suite.

# Additional iterations

Text above is design history from some of the earliest drafts of selectors (as indicated by the inline dates!).

There have been several iterations on this work:

  • The first commit recognizable as this document is: 920f671fe388cc401caf32234d2de98eed0cb9b7 , with comments at https://github.com/ipld/specs/pull/75 .
    • In parallel, note contemporary discussions https://github.com/ipld/specs/pull/66 and https://github.com/ipld/specs/pull/78 .
  • Subsequent commits bb106c0a83a809be07e0b798a87e2437136a62d3 and 0b99488e7c6df91c305410fd4d8c63185fe9d509 introduced the vision of a unified IPLD-native "AST" (with the idea of user-facing ergonomics being address by having multiple future DSLs mapping into that single well-specified AST).
  • Several more evolutions were contained in https://github.com/ipld/specs/pull/116 .
  • Several more evolutions were contained in https://github.com/ipld/specs/pull/133 .
  • Finally, in https://github.com/ipld/specs/pull/142 , we're checkpointing this piece of design history into the "./design-history" folder (it's still good reference for the original goals and to trace some of the problem-solving!); and editing the tip of the selectors docs to emphasize succinctness, and to focus on the solutions rather than the journey we took to get there.