Unified engine for large-scale data analytics
=============================================
[Get Started](/docs/latest/quick-start.html)
What is Apache Spark�
----------------------
Apache Spark⢠is a multi-language engine for executing data engineering, data science, and machine learning on single-node machines or clusters.
Simple.
Fast.
Scalable.
Unified.
Key features
Batch/streaming data
Unify the processing of your data in batches and real-time streaming, using your preferred language: Python, SQL, Scala, Java or R.
SQL analytics
Execute fast, distributed ANSI SQL queries for dashboarding and ad-hoc reporting. Runs faster than most data warehouses.
Data science at scale
Perform Exploratory Data Analysis (EDA) on petabyte-scale data without having to resort to downsampling
Machine learning
Train machine learning algorithms on a laptop and use the same code to scale to fault-tolerant clusters of thousands of machines.
Run now
Install with 'pip'
$ pip install pyspark
$ pyspark
Use the official Docker image
$ docker run -it --rm spark:python3 /opt/spark/bin/pyspark
df = spark.read.json("logs.json")
df.where("age > 21").select("name.first").show()
# Every record contains a label and feature vector
df = spark.createDataFrame(data, ["label", "features"])
# Split the data into train/test datasets
train_df, test_df = df.randomSplit([.80, .20], seed=42)
# Set hyperparameters for the algorithm
rf = RandomForestRegressor(numTrees=100)
# Fit the model to the training data
model = rf.fit(train_df)
# Generate predictions on the test dataset.
model.transform(test_df).show()
df = spark.read.csv("accounts.csv", header=True)
# Select subset of features and filter for balance > 0
filtered_df = df.select("AccountBalance", "CountOfDependents").filter("AccountBalance > 0")
# Generate summary statistics
filtered_df.summary().show()
Run now
$ docker run -it --rm spark /opt/spark/bin/spark-sql
spark-sql>
SELECT
name.first AS first_name,
name.last AS last_name,
age
FROM json.`logs.json`
WHERE age > 21;
Run now
$ docker run -it --rm spark /opt/spark/bin/spark-shell
scala>
val df = spark.read.json("logs.json")
df.where("age > 21")
.select("name.first").show()
Run now
$ docker run -it --rm spark /opt/spark/bin/spark-shell
scala>
Dataset df = spark.read().json("logs.json");
df.where("age > 21")
.select("name.first").show();
Run now
$ docker run -it --rm spark:r /opt/spark/bin/sparkR
\>
df <- read.json(path = "logs.json")
df <- filter(df, df$age > 21)
head(select(df, df$name.first))
The most widely-used engine for scalable computing
Thousands of companies, including 80% of the Fortune 500, use Apache Sparkā¢.
Over 2,000 contributors to the open source project from industry and academia.
Ecosystem
Apache Spark⢠integrates with your favorite frameworks, helping to scale them to thousands of machines.
Data science and Machine learning
SQL analytics and BI
Storage and Infrastructure
Spark SQL engine: under the hood
Apache Spark⢠is built on an advanced distributed SQL engine for large-scale data
[Adaptive Query Execution](/docs/latest/sql-performance-tuning.html#adaptive-query-execution)
Spark SQL adapts the execution plan at runtime, such as automatically setting the number of reducers and join algorithms.
[Support for ANSI SQL](/docs/latest/sql-ref-ansi-compliance.html)
Use the same SQL youāre already comfortable with.
[Structured and unstructured data](/docs/latest/sql-data-sources-json.html)
Spark SQL works on structured tables and unstructured data such as JSON or images.
TPC-DS 1TB No-Stats With vs. Without Adaptive Query Execution
Accelerates TPC-DS queries up to 8x
Join the community
Spark has a thriving open source community, with contributors from around the globe building features, documentation and assisting other users.
[
Mailing list
](/community.html)
[
Source code
](https://github.com/apache/spark)
[
News and events
](/news/)
[
How to contribute
](/contributing.html)
[
Issue tracking
](https://issues.apache.org/jira/projects/SPARK/issues)
[
Committers
](/committers.html)