Apache_Arrow

Apache Arrow

Apache Arrow

Software framework


Apache Arrow is a language-agnostic software framework for developing data analytics applications that process columnar data. It contains a standardized column-oriented memory format that is able to represent flat and hierarchical data for efficient analytic operations on modern CPU and GPU hardware.[2][3][4][5][6] This reduces or eliminates factors that limit the feasibility of working with large sets of data, such as the cost, volatility, or physical constraints of dynamic random-access memory.[7]

Quick Facts Developer(s), Initial release ...

Interoperability

Arrow can be used with Apache Parquet, Apache Spark, NumPy, PySpark, pandas and other data processing libraries. The project includes native software libraries written in C, C++, C#, Go, Java, JavaScript, Julia, MATLAB, Python, R, Ruby, and Rust. Arrow allows for zero-copy reads and fast data access and interchange without serialization overhead between these languages and systems.[2]

Applications

Arrow has been used in diverse domains, including analytics,[8] genomics,[9][7] and cloud computing.[10]

Comparison to Apache Parquet and ORC

Apache Parquet and Apache ORC are popular examples of on-disk columnar data formats. Arrow is designed as a complement to these formats for processing data in-memory.[11] The hardware resource engineering trade-offs for in-memory processing vary from those associated with on-disk storage.[12] The Arrow and Parquet projects include libraries that allow for reading and writing data between the two formats.[13]

Governance

Apache Arrow was announced by The Apache Software Foundation on February 17, 2016,[14] with development led by a coalition of developers from other open source data analytics projects.[15][16][6][17][18] The initial codebase and Java library was seeded by code from Apache Drill.[14]


References

  1. "Apache Arrow 13.0.0 (23 August 2023)". 23 August 2023. Retrieved 21 September 2023.
  2. Yegulalp, Serdar (27 February 2016). "Apache Arrow aims to speed access to big data". InfoWorld.
  3. Dinsmore T.W. (2016). "In-Memory Analytics: Satisfying the Need for Speed". Disruptive Analytics. Apress, Berkeley, CA. pp. 97–116. doi:10.1007/978-1-4842-1311-7_5. ISBN 978-1-4842-1312-4.
  4. Versaci F, Pireddu L, Zanetti G (2016). "Scalable genomics: from raw data to aligned reads on Apache YARN" (PDF). IEEE International Conference on Big Data: 1232–1241.
  5. Maas M, Asanović K, Kubiatowicz J (2017). "Return of the runtimes: rethinking the language runtime system for the cloud 3.0 era". Proceedings of the 16th Workshop on Hot Topics in Operating Systems (ACM): 138–143. doi:10.1145/3102980.3103003.
  6. "The Apache® Software Foundation Announces Apache Arrow™ as a Top-Level Project". The Apache Software Foundation Blog. 17 February 2016. Archived from the original on 2016-03-13.
  7. Le Dem, Julien (28 November 2016). "The first release of Apache Arrow". SD Times.

Share this article:

This article uses material from the Wikipedia article Apache_Arrow, and is written by contributors. Text is available under a CC BY-SA 4.0 International License; additional terms may apply. Images, videos and audio are available under their respective licenses.