快速计算丨在混合云上使用Alluxio可为您节省的基础设施投入成本 →

010-82449668

EN 中文

在 Alluxio 上运行 Apache Flink

在 Alluxio 上运行 Apache Flink

Slack Docker Pulls GitHub edit source

This guide describes how to get Alluxio running with Apache Flink, so that you can easily work with files stored in Alluxio.

Prerequisites

  • Setup Java for Java 8 Update 161 or higher (8u161+), 64-bit.
  • Alluxio has been set up and is running.
  • Flink has been installed and set up.

Configuration

Apache Flink allows to use Alluxio through a generic file system wrapper for the Hadoop file system. Therefore, the configuration of Alluxio is done mostly in Hadoop configuration files.

Set property in core-site.xml

If you have a Hadoop setup next to the Flink installation, add the following property to the core-site.xml configuration file:

<property>
  <name>fs.alluxio.impl</name>
  <value>alluxio.hadoop.FileSystem</value>
</property>

In case you don’t have a Hadoop setup, you have to create a file called core-site.xml with the following contents:

<configuration>
  <property>
    <name>fs.alluxio.impl</name>
    <value>alluxio.hadoop.FileSystem</value>
  </property>
</configuration>

Next, you have to specify the path to the Hadoop configuration in Flink. Open the conf/flink-conf.yaml file in the Flink root directory and set the fs.hdfs.hadoopconf configuration value to the directory containing the core-site.xml. (For newer Hadoop versions, the directory usually ends with etc/hadoop.)

Distribute the Alluxio Client Jar

In order to communicate with Alluxio, we need to provide Flink programs with the Alluxio Core Client jar. We recommend you to download the tarball from Alluxio download page. Alternatively, advanced users can choose to compile this client jar from the source code by following the instructions here. The Alluxio client jar can be found at /<PATH_TO_ALLUXIO>/client/alluxio-2.9.3-client.jar.

We need to make the Alluxio jar file available to Flink, because it contains the configured alluxio.hadoop.FileSystem class.

There are different ways to achieve that:

  • Put the /<PATH_TO_ALLUXIO>/client/alluxio-2.9.3-client.jar file into the lib directory of Flink (for local and standalone cluster setups)
  • Put the /<PATH_TO_ALLUXIO>/client/alluxio-2.9.3-client.jar file into the ship directory for Flink on YARN.
  • Specify the location of the jar file in the HADOOP_CLASSPATH environment variable (make sure its available on all cluster nodes as well). For example like this:
$ export HADOOP_CLASSPATH=/<PATH_TO_ALLUXIO>/client/alluxio-2.9.3-client.jar

In addition, if there are any client-related properties specified in conf/alluxio-site.properties, translate those to env.java.opts in {FLINK_HOME}/conf/flink-conf.yaml for Flink to pick up Alluxio configuration. For example, if you want to configure Alluxio client to use CACHE_THROUGH as the write type, you should add the following to {FLINK_HOME}/conf/flink-conf.yaml.

env.java.opts: -Dalluxio.user.file.writetype.default=CACHE_THROUGH

Note: If there are running flink clusters, stop the flink clusters and restart them to apply the changes to the configuration.

To use Alluxio with Flink, just specify paths with the alluxio:// scheme.

If Alluxio is installed locally, a valid path would look like this alluxio://localhost:19998/user/hduser/gutenberg.

Wordcount Example

This example assumes you have set up Alluxio and Flink as previously described.

Put the file LICENSE into Alluxio, assuming you are in the top level Alluxio project directory:

$ bin/alluxio fs copyFromLocal LICENSE alluxio://localhost:19998/LICENSE

Run the following command from the top level Flink project directory:

$ bin/flink run examples/batch/WordCount.jar \
  --input alluxio://localhost:19998/LICENSE \
  --output alluxio://localhost:19998/output

Open your browser and check http://localhost:19999/browse. There should be an output file output which contains the word counts of the file LICENSE.

辉羲智能 x Alluxio 应用案例

辉羲智能致力打造创新车载智能计算平台,提供高阶智能驾驶芯片、易用开放工具链及全栈自动驾驶解决方案,助力车企实现优质高效的自动驾驶量产交付,构建低成本、大规模和自动化迭代能力,引领数据驱动时代的高阶智慧出行。

望石智慧 x Alluxio 应用案例

望石智慧(StoneWise),成立于2018年,是一家使用人工智能技术驱动新药研发的科技公司,旨在用技术与创新力为医药行业带来新视角,打造世界领先的小分子创新药研发平台。

【济南超算】超算互联网统一存储平台技术研究

国家超级计算济南中心(简称“济南超算”)由国家科技部批准成立,创建于2011年,是从事智能计算和信息处理技术研究及计算服务的综合性研究中心,也是我国首台完全采用自主处理器研制千万亿次超级计算机“神威蓝光”的诞生地。