"git for data" FOSS tools

link source: https://news.ycombinator.com/item?id=26370863

software
  1. Home
  2. Google Sheet
  3. git for data FOSS tools

git for data FOSS tools

link source: https://news.ycombinator.com/item?id=26370863

software

Created and Maintained by Joseph Jacks (OSS Capital) and Justin Cormack (Docker) in early December 2020

Criteria: This sheet lists FOSS technologies (and COSS companies if they exist) which focus on creating "git for data" solutions in various domains.

COSS Company Known Funding (M) FOSS Technology Stars Launched Status Focus Description Lead Creator/PM

- - https://github.com/datahuborg/datahub 200~ September 2013 Dormant data collaboration Data collaboration platform https://twitter.com/anantpb

AERGO $30 https://github.com/aergoio/litetree 1,400~ August 2018 Dormant sqlite Branch and merge SQLite https://twitter.com/aergo_io

Attic Labs $8 https://github.com/attic-labs/noms 7,300~ May 2018 Dormant Database Declarative content addressed database https://twitter.com/aboodman

DoltHub $5 https://github.com/dolthub/dolt 2,000~ December 2018 Active SQL Version SQL tables, merge, branch. Hosted hub for public data. https://twitter.com/timsehn

Dotscience $10 https://github.com/dotmesh-oss/dotmesh 500~ February 2018 Dormant ML Originally general purpose versioned data, pivoted to replicatable experiments https://twitter.com/lmarsden

GitLab $434 https://gitlab.com/meltano/meltano 400~ July 2018 Active ETL Orchestration of ELT pipelines https://twitter.com/DouweM

Gretel.ai $16 https://github.com/gretelai/gretel-synthetics 100~ March 2020 Active data generation Synthetic privacy preserving data generation https://twitter.com/AlexWatson405

Grist Labs - https://github.com/paulfitz/daff 550~ January 2013 Dormant Diff Data diff tool https://twitter.com/fitzyfitzyfitzy

Grist Labs - https://github.com/gristlabs/grist-core 20~ May 2020 Active Spreadsheet Versioned spreadsheet https://twitter.com/fitzyfitzyfitzy

Iterative $4 https://github.com/iterative/dvc 6,800~ March 2017 Active ML Git/Git LFS and Makefiles for ML and data science https://twitter.com/rkuprieiev

Pachyderm $28 https://github.com/pachyderm/pachyderm 4700~ October 2014 Active data science Version controlled data ingestion and processing pipeline https://twitter.com/jdoliner

Qri - https://github.com/qri-io/qri 1,000~ October 2016 Active data management Dataset version control https://twitter.com/b_fiive

Quilt Data $4 https://github.com/quiltdata/quilt 1,000~ Febrary 2017 Active ML/data Versioning for small and large data that don't fit in git eg ML models. S3/AWS based. https://twitter.com/akarve

Replicate - https://github.com/replicate/replicate 500~ August 2020 Active ML Version ML models, focus on simpler workflows and introducing people to version control https://twitter.com/bfirsh

Tarides - https://github.com/mirage/irmin 1,400~ August 2017 Active Blockchain/general Git for merging distributed data models. OCaml. Used by Tezos. https://twitter.com/eriangazag

TerminusDB $1 https://github.com/terminusdb/terminusdb 1,000~ May 2019 Active Database Revision controlled graph database https://twitter.com/GavinMGleason

Treeverse - https://github.com/treeverse/lakeFS 500~ September 2019 Active data management Versioned data lake for ETL and data science https://twitter.com/lakeFS

XetHub $7.5M ? ? ? Active data management

git for data FOSS tools
Info
Tags Software
Type Google Sheet
Published 08/05/2025, 09:07:46

Resources

LD - In pursuit of better levels
Let's get more girls into coding.
How to Learn CS + Become a full-stack web Software Engineer
Attacking Secondary Contexts in Web Applications
Specification gaming examples in AI - master list
Ransomware Overview
Game Design Resources
Tech Lead Expectations for Engineering Projects (Gergely Orosz @Uber)