List of Apache Spark MLlib Customers

Apache Software Foundation

1000 N West Street, Suite 1200,
Wilmington, 19801, DE,
United States

https://www.apache.org/

1 302-250-4080

Non Profit

$2M

Since 2010, our global team of researchers has been studying Apache Spark MLlib customers around the world, aggregating massive amounts of data points that form the basis of our forecast assumptions and perhaps the rise and fall of certain vendors and their products on a quarterly basis.

Each quarter our research team identifies companies that have purchased Apache Spark MLlib for ML and Data Science Platforms from public (Press Releases, Customer References, Testimonials, Case Studies and Success Stories) and proprietary sources, including the customer size, industry, location, implementation status, partner involvement, LOB Key Stakeholders and related IT decision-makers contact details.

Companies using Apache Spark MLlib for ML and Data Science Platforms include: Salesforce, a United States based Professional Services organisation with 83334 employees and revenues of $41.53 billion, CrowdStrike, a United States based Professional Services organisation with 10363 employees and revenues of $3.95 billion, Yelp, a United States based Professional Services organisation with 5116 employees and revenues of $1.41 billion, FINRA, a United States based Professional Services organisation with 3600 employees and revenues of $1.11 billion, GumGum, a United States based Professional Services organisation with 480 employees and revenues of $113.0 million and many others.

Contact us if you need a completed and verified list of companies using Apache Spark MLlib, including the breakdown by industry (21 Verticals), Geography (Region, Country, State, City), Company Size (Revenue, Employees, Asset) and related IT Decision Makers, Key Stakeholders, business and technology executives responsible for the software purchases.

The Apache Spark MLlib customer wins are being incorporated in our Enterprise Applications Buyer Insight and Technographics Customer Database which has over 100 data fields that detail company usage of software systems and their digital transformation initiatives. Apps Run The World wants to become your No. 1 technographic data source!

Apply Filters For Customers

Filters

Customer	Industry	Empl.	Revenue	Country	Vendor	Application	Category	When	SI	Insight
CrowdStrike	Professional Services	10363	$4.0B	United States	Apache Software	Apache Spark MLlib	ML and Data Science Platforms	2016	n/a	In 2016, CrowdStrike implemented Apache Spark MLlib to perform large-scale feature extraction and to drive machine learning classification of event data ingested from Falcon Host, its software-as-a-service endpoint protection solution, under the Apps Category . The deployment focused on embedding Apache Spark MLlib into the data processing pipeline used by the security research and engineering teams to support behavioral analysis and model training workflows. The implementation concentrated on Spark-based feature engineering and model scoring capabilities, using Apache Spark MLlib for scalable distributed machine learning workloads. CrowdStrike configured Spark jobs for batch feature extraction and iterative model development, and instrumented job lifecycle controls to align compute usage with the engineering team’s need for agility. Operational integration included coupling Apache Spark MLlib with CrowdStrike’s Apache Cassandra backed Threat Graph and running the analytics stack on AWS infrastructure to reduce operational overhead. The architecture emphasized ephemeral instance control for Cassandra, the ability to start and stop nodes for environment rebuilds, and scalable compute provisioning for Spark to address rapidly growing event volumes. Governance and operational requirements centered on high availability, scalability, and cost-effective storage for petabyte-scale Cassandra data. Rollout priorities included maintaining uptime for Falcon Host ingestion pipelines, enabling reproducible environment rebuilds for engineering, and ensuring Spark MLlib workflows could scale without increasing on-premises operational burden.
FINRA	Professional Services	3600	$1.1B	United States	Apache Software	Apache Spark MLlib	ML and Data Science Platforms	2019	n/a	In 2019, FINRA deployed Apache Spark MLlib on Amazon EMR to move from SQL batch processes on-prem to cloud native distributed analytics for billions of time-ordered market events. The work was implemented within the ML and Data Science Platforms category to provide scalable machine learning infrastructure for surveillance and analytics use cases. Configuration emphasized Apache Spark MLlib based model training and machine learning pipeline orchestration, enabling feature engineering, iterative model development, and large scale distributed computation. Workloads were restructured from nightly SQL batch jobs to continuous Spark workflows to support faster training cycles and backtesting on historic market downturn datasets. Operationally the deployment used Amazon EMR for compute elasticity to process high velocity market event streams and historical order tapes at scale. These compute and ML workflows were consumed by data science teams supporting market surveillance, risk analytics, investor protection, and market integrity functions. Governance shifted from batch release cycles to pipeline and model governance with standardized validation and backtesting workflows to ensure model integrity for surveillance and compliance. FINRA can now test models on realistic data from market downturns, enhancing its ability to provide investor protection and promote market integrity.
GumGum	Professional Services	480	$113M	United States	Apache Software	Apache Spark MLlib	ML and Data Science Platforms	2017	n/a	In 2017, GumGum implemented Apache Spark MLlib to operationalize machine learning across its advertising analytics stack and to handle extremely high event volumes. The implementation targeted a platform that ingests more than 1 billion events per day, approximately 6 TB of data daily, and was selected to support continuous processing and model-driven inventory forecasting, addressing the company need to expedite customer decision making and scale quickly. Apache Spark MLlib was deployed on Amazon EMR as the primary machine learning runtime, with configurations for model training, batch scoring, and feature engineering pipelines. The deployment uses Apache Spark MLlib for inventory forecasting workflows and integrates standard Spark MLlib capabilities for model fitting, transformation pipelines, and distributed feature processing to support programmatic and native advertising analytics. The architecture places ad servers at the event edge, writing event logs that are uploaded to Amazon Simple Storage Service S3 on an hourly cadence. Amazon Data Pipeline orchestrates production, testing, and development workflows, Amazon EMR runs Apache Spark MLlib workloads alongside Hadoop for hourly data processing, and processed outputs are persisted into Amazon Redshift for downstream analytics and reporting. Operational coverage includes production, testing, and development environments and impacts ad operations and analytics functions responsible for campaign forecasting and reporting. Governance and operationalization relied on pipeline-driven environment segregation and hourly ingestion patterns to remove processing bottlenecks and maintain continuous processing requirements. The implementation of Apache Spark MLlib at GumGum is positioned as a scalable, EMR-hosted machine learning layer within the larger AWS-based data pipeline, designed to support programmatic advertising, image recognition derived signals, and customer-facing analytics.
Salesforce	Professional Services	83334	$41.5B	United States	Apache Software	Apache Spark MLlib	ML and Data Science Platforms	2020	n/a	In 2020, Salesforce deployed Apache Spark MLlib inside Salesforce Einstein to perform distributed, parallel computations on multi-terabyte datasets, using the application as part of its ML and Data Science Platforms stack. The implementation positioned Apache Spark MLlib as the core distributed processing engine to scale model training and large scale feature engineering beyond single-machine CPU limits, enabling horizontal parallelism across commodity clusters. The implementation emphasized Spark partitioning and configuration as primary functional capabilities, using DataFrame and RDD partition strategies, distributed MLlib algorithm pipelines, and tuned executor and core allocations to control task parallelism. Engineers configured partition sizing and shuffle behavior, and embedded iterative MLlib training workflows and pipeline stages to optimize data locality and reduce network I O overhead, reflecting category-aligned practices for ML and Data Science Platforms. Operationally the deployment was integrated into Einstein data engineering and model training pipelines, supporting batch scoring and iterative model development across Salesforce engineering and data science teams. The scope focused on platform-level orchestration of distributed jobs rather than single-node optimization, with Spark cluster resource management and partition-aware job submission forming the primary operational surface. Governance and process changes centered on engineering ownership of partitioning policies, test-driven tuning of partition counts, and monitoring of task skew to prevent imbalanced parallelism. The team documented that relying on Spark defaults can introduce significant inefficiencies, with an observed performance gap of up to 70 percent when applications were not tuned for partitioning, highlighting the need for explicit partition strategy as part of platform governance.
Yelp	Professional Services	5116	$1.4B	United States	Apache Software	Apache Spark MLlib	ML and Data Science Platforms	2018	n/a	In 2018, Yelp implemented Apache Spark MLlib as the core engine within its ML and Data Science Platforms to support Ad Targeting, training models from billions of ad impressions and scaling gradient boosted trees with more than three million nodes. Yelp used Apache Spark MLlib to operationalize many stages of its large scale machine learning pipeline for advertising, embedding the application directly in model training, feature preparation, visualization and diagnostics workflows. The implementation emphasized large scale feature engineering and model training capabilities, with Apache Spark MLlib configured to execute distributed feature transformations, iterative algorithm training and model evaluation at scale. Functional capabilities explicitly included feature engineering, visualizations for model diagnostics, supervised learning training flows such as gradient boosted tree training, and model evaluation and diagnostics to validate candidate models prior to production rollout. Architecture and operational coverage focused on distributed Spark cluster execution across Yelp data science and ML engineering teams, integrating the ML and Data Science Platforms into advertising model pipelines to support Ad Targeting business functions. The deployment covered multi stage batch pipelines for feature extraction and model training, and analytics oriented visualization stages for diagnostics and evaluation, reflecting a production oriented architecture for large scale advertising workloads. Governance and rollout centered on model evaluation and diagnostics as primary controls, and the project documented challenges in building and deploying a large scale intelligent system in production, which shaped operational practices for monitoring, evaluation and iterative model improvement. The narrative for Yelp highlights Apache Spark MLlib as the deployed application, within the ML and Data Science Platforms category, applied specifically to Ad Targeting and associated data science and engineering processes.

Showing 1 to 5 of 5 entries

Buyer Intent: Companies Evaluating Apache Spark MLlib

ARTW Buyer Intent uncovers actionable customer signals, identifying software buyers actively evaluating Apache Spark MLlib. Gain ongoing access to real-time prospects and uncover hidden opportunities.

Discover Software Buyers actively Evaluating Enterprise Applications

Filters

Logo	Company	Industry	Employees	Revenue	Country	Evaluated
No data found

FAQ - APPS RUN THE WORLD Apache Spark MLlib Coverage

Apache Spark MLlib is a ML and Data Science Platforms solution from Apache Software.

Companies worldwide use Apache Spark MLlib, from small firms to large enterprises across 21+ industries.

Organizations such as Salesforce, CrowdStrike, Yelp, FINRA and GumGum are recorded users of Apache Spark MLlib for ML and Data Science Platforms.

Companies using Apache Spark MLlib are most concentrated in Professional Services, with adoption spanning over 21 industries.

Companies using Apache Spark MLlib are most concentrated in United States, with adoption tracked across 195 countries worldwide. This global distribution highlights the popularity of Apache Spark MLlib across Americas, EMEA, and APAC.

Companies using Apache Spark MLlib range from small businesses with 0-100 employees - 0%, to mid-sized firms with 101-1,000 employees - 20%, large organizations with 1,001-10,000 employees - 40%, and global enterprises with 10,000+ employees - 40%.

Customers of Apache Spark MLlib include firms across all revenue levels — from $0-100M, to $101M-$1B, $1B-$10B, and $10B+ global corporations.

Contact APPS RUN THE WORLD to access the full verified Apache Spark MLlib customer database with detailed Firmographics such as industry, geography, revenue, and employee breakdowns as well as key decision makers in charge of ML and Data Science Platforms.

List of Apache Spark MLlib Customers

Apply Filters For Customers

Buyer Intent: Companies Evaluating Apache Spark MLlib

Discover Software Buyers actively Evaluating Enterprise Applications

Q1. What is Apache Spark MLlib used for?

Q2. Who uses Apache Spark MLlib for ML and Data Science Platforms?

Q3. Which companies use Apache Spark MLlib?

Q4. What is the industry breakdown of companies using Apache Spark MLlib?

Q5. What is the country breakdown of companies using Apache Spark MLlib?

Q6. What is the breakdown by employee size of companies using Apache Spark MLlib?

Q7. What is the breakdown by revenue of companies using Apache Spark MLlib?

Q8. How can I get the full list of companies using Apache Spark MLlib?