Created and Maintained by Joseph Jacks (OSS Capital) and Justin Cormack (Docker) in early December 2020
Criteria: This sheet lists FOSS technologies (and COSS companies if they exist) which focus on creating "git for data" solutions in various domains.
COSS Company Known Funding (M) FOSS Technology Stars Launched Status Focus Description Lead Creator/PM
- - https://github.com/datahuborg/datahub 200~ September 2013 Dormant data collaboration Data collaboration platform https://twitter.com/anantpb
AERGO $30 https://github.com/aergoio/litetree 1,400~ August 2018 Dormant sqlite Branch and merge SQLite https://twitter.com/aergo_io
Attic Labs $8 https://github.com/attic-labs/noms 7,300~ May 2018 Dormant Database Declarative content addressed database https://twitter.com/aboodman
DoltHub $5 https://github.com/dolthub/dolt 2,000~ December 2018 Active SQL Version SQL tables, merge, branch. Hosted hub for public data. https://twitter.com/timsehn
Dotscience $10 https://github.com/dotmesh-oss/dotmesh 500~ February 2018 Dormant ML Originally general purpose versioned data, pivoted to replicatable experiments https://twitter.com/lmarsden
GitLab $434 https://gitlab.com/meltano/meltano 400~ July 2018 Active ETL Orchestration of ELT pipelines https://twitter.com/DouweM
Gretel.ai $16 https://github.com/gretelai/gretel-synthetics 100~ March 2020 Active data generation Synthetic privacy preserving data generation https://twitter.com/AlexWatson405
Grist Labs - https://github.com/paulfitz/daff 550~ January 2013 Dormant Diff Data diff tool https://twitter.com/fitzyfitzyfitzy
Grist Labs - https://github.com/gristlabs/grist-core 20~ May 2020 Active Spreadsheet Versioned spreadsheet https://twitter.com/fitzyfitzyfitzy
Iterative $4 https://github.com/iterative/dvc 6,800~ March 2017 Active ML Git/Git LFS and Makefiles for ML and data science https://twitter.com/rkuprieiev
Pachyderm $28 https://github.com/pachyderm/pachyderm 4700~ October 2014 Active data science Version controlled data ingestion and processing pipeline https://twitter.com/jdoliner
Qri - https://github.com/qri-io/qri 1,000~ October 2016 Active data management Dataset version control https://twitter.com/b_fiive
Quilt Data $4 https://github.com/quiltdata/quilt 1,000~ Febrary 2017 Active ML/data Versioning for small and large data that don't fit in git eg ML models. S3/AWS based. https://twitter.com/akarve
Replicate - https://github.com/replicate/replicate 500~ August 2020 Active ML Version ML models, focus on simpler workflows and introducing people to version control https://twitter.com/bfirsh
Tarides - https://github.com/mirage/irmin 1,400~ August 2017 Active Blockchain/general Git for merging distributed data models. OCaml. Used by Tezos. https://twitter.com/eriangazag
TerminusDB $1 https://github.com/terminusdb/terminusdb 1,000~ May 2019 Active Database Revision controlled graph database https://twitter.com/GavinMGleason
Treeverse - https://github.com/treeverse/lakeFS 500~ September 2019 Active data management Versioned data lake for ETL and data science https://twitter.com/lakeFS
XetHub $7.5M ? ? ? Active data management