there are multiple tools to infer JSON Schema spec from an example or a set of examples of JSON documents. however,
so i made my own thing, a library + CLI that consumes a stream of values (for CLI — JSON documents) and generates python code to describe it. the usage is:
# fetch some example JSON data, e.g. GitHub releases list
gh api \
-H "Accept: application/vnd.github+json" \
-H "X-GitHub-Api-Version: 2022-11-28" \
/repos/facebook/react/releases --paginate |> data.json
# run the cli
slow-learner learn --spread --type-name Release data.json
this generates the following code (black
ed for readability):
"""
This file contains Python 3.8+ type definitions generated by TypeLearner from 99 observed value(s)
Source JSON files:
- /Users/njvh/Documents/Personal/slow-learner/data.json
"""
from typing import List
from typing import Literal
from typing_extensions import NotRequired
from typing import Optional
from typing import TypedDict
from typing import Union
class ReleaseAuthor(TypedDict):
login: str
id: int
node_id: str
avatar_url: str
gravatar_id: Literal[""]
url: str
html_url: str
followers_url: str
following_url: str
gists_url: str
starred_url: str
subscriptions_url: str
organizations_url: str
repos_url: str
events_url: str
received_events_url: str
type: Literal["User"]
site_admin: Literal[False]
class ReleaseAssetsItemUploader(TypedDict):
login: str
id: int
node_id: str
avatar_url: str
gravatar_id: Literal[""]
url: str
html_url: str
followers_url: str
following_url: str
gists_url: str
starred_url: str
subscriptions_url: str
organizations_url: str
repos_url: str
events_url: str
received_events_url: str
type: Literal["User"]
site_admin: Literal[False]
class ReleaseAssetsItem(TypedDict):
url: str
id: int
node_id: str
name: str
label: Optional[str]
uploader: ReleaseAssetsItemUploader
content_type: Union[
Literal["text/javascript"],
Literal["application/javascript"],
Literal["application/x-javascript"],
Literal["application/zip"],
]
state: Literal["uploaded"]
size: int
download_count: int
created_at: str
updated_at: str
browser_download_url: str
ReleaseReactions = TypedDict(
"ReleaseReactions",
{
"url": str,
"total_count": int,
"+1": int,
"-1": Literal[0],
"laugh": int,
"hooray": int,
"confused": Literal[0],
"heart": int,
"rocket": int,
"eyes": int,
},
)
class Release(TypedDict):
url: str
assets_url: str
upload_url: str
html_url: str
id: int
author: ReleaseAuthor
node_id: str
tag_name: str
target_commitish: str
name: str
draft: Literal[False]
prerelease: bool
created_at: str
published_at: str
assets: List[ReleaseAssetsItem]
tarball_url: str
zipball_url: str
body: str
reactions: NotRequired[ReleaseReactions]
it learns structured data in the form of TypedDict
(handling nasty cases of keys not being valid python identifiers),
and recourses into complex data structures. but my favorite feature is that it learns
Literal
types for fields where not too many (10 by default) distinct values were found. for this example case, it contains some
false positives (e.g. Literal[False]
where type really should be bool
), but it’s trivial to edit by hand.
i use this library whenever i need to work with a new JSON data for longer than 5 minutes, and my life has improved dramatically. oh, and the name is a reference to this cool song.