home / projects /

slow-learner, python type inference tool

nov 2022
links: github PyPI

there are multiple tools to infer JSON Schema spec from an example or a set of examples of JSON documents. however,

so i made my own thing, a library + CLI that consumes a stream of values (for CLI — JSON documents) and generates python code to describe it. the usage is:

# fetch some example JSON data, e.g. GitHub releases list
gh api \
  -H "Accept: application/vnd.github+json" \
  -H "X-GitHub-Api-Version: 2022-11-28" \
  /repos/facebook/react/releases --paginate |> data.json

# run the cli
slow-learner learn --spread --type-name Release data.json 

this generates the following code (blacked for readability):

"""
This file contains Python 3.8+ type definitions generated by TypeLearner from 99 observed value(s)

Source JSON files:
- /Users/njvh/Documents/Personal/slow-learner/data.json
"""

from typing import List
from typing import Literal
from typing_extensions import NotRequired
from typing import Optional
from typing import TypedDict
from typing import Union


class ReleaseAuthor(TypedDict):
    login: str
    id: int
    node_id: str
    avatar_url: str
    gravatar_id: Literal[""]
    url: str
    html_url: str
    followers_url: str
    following_url: str
    gists_url: str
    starred_url: str
    subscriptions_url: str
    organizations_url: str
    repos_url: str
    events_url: str
    received_events_url: str
    type: Literal["User"]
    site_admin: Literal[False]


class ReleaseAssetsItemUploader(TypedDict):
    login: str
    id: int
    node_id: str
    avatar_url: str
    gravatar_id: Literal[""]
    url: str
    html_url: str
    followers_url: str
    following_url: str
    gists_url: str
    starred_url: str
    subscriptions_url: str
    organizations_url: str
    repos_url: str
    events_url: str
    received_events_url: str
    type: Literal["User"]
    site_admin: Literal[False]


class ReleaseAssetsItem(TypedDict):
    url: str
    id: int
    node_id: str
    name: str
    label: Optional[str]
    uploader: ReleaseAssetsItemUploader
    content_type: Union[
        Literal["text/javascript"],
        Literal["application/javascript"],
        Literal["application/x-javascript"],
        Literal["application/zip"],
    ]
    state: Literal["uploaded"]
    size: int
    download_count: int
    created_at: str
    updated_at: str
    browser_download_url: str


ReleaseReactions = TypedDict(
    "ReleaseReactions",
    {
        "url": str,
        "total_count": int,
        "+1": int,
        "-1": Literal[0],
        "laugh": int,
        "hooray": int,
        "confused": Literal[0],
        "heart": int,
        "rocket": int,
        "eyes": int,
    },
)


class Release(TypedDict):
    url: str
    assets_url: str
    upload_url: str
    html_url: str
    id: int
    author: ReleaseAuthor
    node_id: str
    tag_name: str
    target_commitish: str
    name: str
    draft: Literal[False]
    prerelease: bool
    created_at: str
    published_at: str
    assets: List[ReleaseAssetsItem]
    tarball_url: str
    zipball_url: str
    body: str
    reactions: NotRequired[ReleaseReactions]

it learns structured data in the form of TypedDict (handling nasty cases of keys not being valid python identifiers), and recourses into complex data structures. but my favorite feature is that it learns Literal types for fields where not too many (10 by default) distinct values were found. for this example case, it contains some false positives (e.g. Literal[False] where type really should be bool), but it’s trivial to edit by hand.

i use this library whenever i need to work with a new JSON data for longer than 5 minutes, and my life has improved dramatically. oh, and the name is a reference to this cool song.

python library