Data-Juicer: The Data Operating System for the Foundation Model Era (Tool) (opens in new tab)
A Ray-native framework delivers 200+ composable data-curation operators for cleaning, deduplicating, synthesizing, and analyzing text, image, audio, video, and multimodal training data. Teams define reusable pipelines as YAML recipes, combine operators into custom workflows, and scale execution from local machines to large distributed clusters.
Read the original article