šŸ³ļøApache Sparkā„¢ - Unified Engine for large-scale data analytics

Website faviconspark.apache.org

Apache Spark is a multi-language engine for executing data engineering, data science, and machine learning on single-node machines or clusters.

Unified engine for large-scale data analytics

=============================================

[Get Started](/docs/latest/quick-start.html)

What is Apache Sparkā„¢?

----------------------

Apache Sparkā„¢ is a multi-language engine for executing data engineering, data science, and machine learning on single-node machines or clusters.

Simple.  

Fast.  

Scalable.  

Unified.

Key features

Batch/streaming data

Unify the processing of your data in batches and real-time streaming, using your preferred language: Python, SQL, Scala, Java or R.

SQL analytics

Execute fast, distributed ANSI SQL queries for dashboarding and ad-hoc reporting. Runs faster than most data warehouses.

Data science at scale

Perform Exploratory Data Analysis (EDA) on petabyte-scale data without having to resort to downsampling

Machine learning

Train machine learning algorithms on a laptop and use the same code to scale to fault-tolerant clusters of thousands of machines.

Run now

Install with 'pip'

$ pip install pyspark

$ pyspark

Use the official Docker image

$ docker run -it --rm spark:python3 /opt/spark/bin/pyspark

    df = spark.read.json("logs.json")

    df.where("age > 21").select("name.first").show()

    # Every record contains a label and feature vector

    df = spark.createDataFrame(data, ["label", "features"])

    # Split the data into train/test datasets

    train_df, test_df = df.randomSplit([.80, .20], seed=42)

    # Set hyperparameters for the algorithm

    rf = RandomForestRegressor(numTrees=100)

    # Fit the model to the training data

    model = rf.fit(train_df)

    # Generate predictions on the test dataset.

    model.transform(test_df).show()

    df = spark.read.csv("accounts.csv", header=True)

    # Select subset of features and filter for balance > 0

    filtered_df = df.select("AccountBalance", "CountOfDependents").filter("AccountBalance > 0")

    # Generate summary statistics

    filtered_df.summary().show()

Run now

$ docker run -it --rm spark /opt/spark/bin/spark-sql

spark-sql>

    SELECT

      name.first AS first_name,

      name.last AS last_name,

      age

    FROM json.`logs.json`

      WHERE age > 21;

Run now

$ docker run -it --rm spark /opt/spark/bin/spark-shell

scala>

    val df = spark.read.json("logs.json")

    df.where("age > 21")

      .select("name.first").show()

Run now

$ docker run -it --rm spark /opt/spark/bin/spark-shell

scala>

    Dataset df = spark.read().json("logs.json");

    df.where("age > 21")

      .select("name.first").show();

Run now

$ docker run -it --rm spark:r /opt/spark/bin/sparkR

\>

    df <- read.json(path = "logs.json")

    df <- filter(df, df$age > 21)

    head(select(df, df$name.first))

The most widely-used engine for scalable computing

Thousands of companies, including 80% of the Fortune 500, use Apache Sparkā„¢.  

Over 2,000 contributors to the open source project from industry and academia.

Ecosystem

Apache Sparkā„¢ integrates with your favorite frameworks, helping to scale them to thousands of machines.

Data science and Machine learning

SQL analytics and BI

Storage and Infrastructure

Spark SQL engine: under the hood

Apache Sparkā„¢ is built on an advanced distributed SQL engine for large-scale data

[Adaptive Query Execution](/docs/latest/sql-performance-tuning.html#adaptive-query-execution)

Spark SQL adapts the execution plan at runtime, such as automatically setting the number of reducers and join algorithms.

[Support for ANSI SQL](/docs/latest/sql-ref-ansi-compliance.html)

Use the same SQL you’re already comfortable with.

[Structured and unstructured data](/docs/latest/sql-data-sources-json.html)

Spark SQL works on structured tables and unstructured data such as JSON or images.

TPC-DS 1TB No-Stats With vs. Without Adaptive Query Execution

Accelerates TPC-DS queries up to 8x

Join the community

Spark has a thriving open source community, with contributors from around the globe building features, documentation and assisting other users.

[

Mailing list

](/community.html)

[

Source code

](https://github.com/apache/spark)

[

News and events

](/news/)

[

How to contribute

](/contributing.html)

[

Issue tracking

](https://issues.apache.org/jira/projects/SPARK/issues)

[

Committers

](/committers.html)