Monday, July 2, 2018

Reading and Writing Avro Files from the Command Line

Apache Avro is becoming one of the most popular data serialization formats nowadays, and this holds true particularly for Hadoop-based big data platforms because tools like Pig, Hive and of course Hadoop itself natively support reading and writing data in Avro format. Many users seem to enjoy Avro but I have heard many complaints about not being able to conveniently read or write Avro files with command line tools – “Avro is nice, but why do I have to write Java or Python code just to quickly see what’s in a binary Avro file, or discover at least its Avro schema?”
To those users it comes as a surprise that Avro actually ships with exactly such command line tools but apparently they are not prominently advertised or documented as such. In this short article I will show a few hands-on examples on how to read, write, compress and convert data from and to binary Avro using Avro Tools 1.7.4.


Here is an overview of what we want to do:

What we want to do

  • We will start with an example Avro schema and a corresponding data file in plain-text JSON format.
  • We will use Avro Tools to convert the JSON file into binary Avro, without and with compression (Snappy), and from binary Avro back to JSON.

Getting Avro Tools

You can get a copy of the latest stable Avro Tools jar file from the Avro Releases page. The actual file is in the java subdirectory of a given Avro release version. Here is a direct link to avro-tools-1.7.4.jar (11 MB) on the US Apache mirror site.
Save avro-tools-1.7.4.jar to a directory of your choice. I will use ~/avro-tools-1.7.4.jar for the examples shown below.

Tools included in Avro Tools

Just run Avro Tools without any parameters to see what’s included:
$ java -jar ~/avro-tools-1.7.4.jar
[...snip...]
Available tools:
      compile  Generates Java code for the given schema.
       concat  Concatenates avro files without re-compressing.
   fragtojson  Renders a binary-encoded Avro datum as JSON.
     fromjson  Reads JSON records and writes an Avro data file.
     fromtext  Imports a text file into an avro data file.
      getmeta  Prints out the metadata of an Avro data file.
    getschema  Prints out schema of an Avro data file.
          idl  Generates a JSON schema from an Avro IDL file
       induce  Induce schema/protocol from Java class/interface via reflection.
   jsontofrag  Renders a JSON-encoded Avro datum as binary.
      recodec  Alters the codec of a data file.
  rpcprotocol  Output the protocol of a RPC service
   rpcreceive  Opens an RPC Server and listens for one message.
      rpcsend  Sends a single RPC message.
       tether  Run a tethered mapreduce job.
       tojson  Dumps an Avro data file as JSON, one record per line.
       totext  Converts an Avro data file to a text file.
  trevni_meta  Dumps a Trevni file's metadata as JSON.
trevni_random  Create a Trevni file filled with random instances of a schema.
trevni_tojson  Dumps a Trevni file as JSON.
Likewise run any particular tool without parameters to see its usage/help output. For example, here is the help of the fromjson tool:
$ java -jar ~/avro-tools-1.7.4.jar fromjson
Expected 1 arg: input_file
Option                                  Description
------                                  -----------
--codec                                 Compression codec (default: null)
--schema                                Schema
--schema-file                           Schema File
Note that most of the tools write to STDOUT, so normally you would like to pipe their output to a file via the Bash > redirection operator (particularly when working with large files).


Avro schema

The schema below defines a tuple of (username, tweet, timestamp) as the format of our example data records.
File: twitter.avsc:
{
  "type" : "record",
  "name" : "twitter_schema",
  "namespace" : "com.miguno.avro",
  "fields" : [ {
    "name" : "username",
    "type" : "string",
    "doc"  : "Name of the user account on Twitter.com"
  }, {
    "name" : "tweet",
    "type" : "string",
    "doc"  : "The content of the user's Twitter message"
  }, {
    "name" : "timestamp",
    "type" : "long",
    "doc"  : "Unix epoch time in seconds"
  } ],
  "doc:" : "A basic schema for storing Twitter messages"
}

Data records in JSON format

And here is some corresponding example data with two records that follow the schema defined in the previous section. We store this data in the file twitter.json.
Example data in twitter.json in JSON format:
{"username":"miguno","tweet":"Rock: Nerf paper, scissors is fine.","timestamp": 1366150681 }
{"username":"BlizzardCS","tweet":"Works as intended.  Terran is IMBA.","timestamp": 1366154481 }

Converting to and from binary Avro

JSON to binary Avro

Without compression:
$ java -jar ~/avro-tools-1.7.4.jar fromjson --schema-file twitter.avsc twitter.json > twitter.avro
With Snappy compression:
$ java -jar ~/avro-tools-1.7.4.jar fromjson --codec snappy --schema-file twitter.avsc twitter.json > twitter.snappy.avro
Note for Mac OS X users: If you run into SnappyError: [FAILED_TO_LOAD_NATIVE_LIBRARY]when trying to compress the data with Snappy make sure you use JDK 6 and not JDK 7.

Binary Avro to JSON

The same command will work on both uncompressed and compressed data.
$ java -jar ~/avro-tools-1.7.4.jar tojson twitter.avro > twitter.json
$ java -jar ~/avro-tools-1.7.4.jar tojson twitter.snappy.avro > twitter.json
Note for Mac OS X users: If you run into SnappyError: [FAILED_TO_LOAD_NATIVE_LIBRARY]when trying to decompress the data with Snappy make sure you use JDK 6 and not JDK 7.

Retrieve Avro schema from binary Avro

The same command will work on both uncompressed and compressed data.
$ java -jar ~/avro-tools-1.7.4.jar getschema twitter.avro > twitter.avsc
$ java -jar ~/avro-tools-1.7.4.jar getschema twitter.snappy.avro > twitter.avsc

Known Issues of Snappy with JDK 7 on Mac OS X

If you happen to use JDK 7 on Mac OS X 10.8 you will run into the error below when trying to run the Snappy related commands above. In that case make sure to explicitly use JDK 6. On Mac OS 10.8 the JDK 6 java binary is by default available at /System/Library/Java/JavaVirtualMachines/1.6.0.jdk/Contents/Home/bin/java.
The cause of this problem is documented in the bug report Native (Snappy) library loading fails on openjdk7u4 for mac. This bug is already fixed in the latest Snappy-Java 1.5 milestone releases, but Avro 1.7.4 still depends on the latest stable release of Snappy-Java which is 1.0.4.1 (see lang/java/pom.xmlin the Avro source code).
I also found that one way to fix this problem when writing your own Java code is to explicitly require Snappy 1.5.x. Here is the relevant dependency declaration for build.gradle in case you are using Gradle. This seems to solve the problem, but I have yet to confirm whether this is a safe way for production scenarios.
// Required to fix a Snappy native library error on OS X when trying to compress Avro files with Snappy;
// Avro 1.7.4 uses the latest stable release of Snappy, 1.0.4.1 (see avro/lang/java/pom.xml) that still contains
// the original bug described at https://github.com/xerial/snappy-java/issues/6.
//
// Note that in a production setting we do not care about OS X, so we could use Snappy 1.0.4.1 as required by
// Avro 1.7.4 as is.
//
compile group: 'org.xerial.snappy', name: 'snappy-java', version: '1.0.5-M4'
Detailed error message:
$ uname -a
Darwin mac.local 12.3.0 Darwin Kernel Version 12.3.0: Sun Jan  6 22:37:10 PST 2013; root:xnu-2050.22.13~1/RELEASE_X86_64 x86_64

$ java -version
java version "1.7.0_17"
Java(TM) SE Runtime Environment (build 1.7.0_17-b02)
Java HotSpot(TM) 64-Bit Server VM (build 23.7-b01, mixed mode)

$ java -jar ~/avro-tools-1.7.4.jar fromjson --codec snappy --schema-file twitter.avsc twitter.json > twitter.snappy.avro

java.lang.reflect.InvocationTargetException
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
 at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:601)
 at org.xerial.snappy.SnappyLoader.loadNativeLibrary(SnappyLoader.java:317)
 at org.xerial.snappy.SnappyLoader.load(SnappyLoader.java:219)
 at org.xerial.snappy.Snappy.<clinit>(Snappy.java:44)
 at org.apache.avro.file.SnappyCodec.compress(SnappyCodec.java:43)
 at org.apache.avro.file.DataFileStream$DataBlock.compressUsing(DataFileStream.java:349)
 at org.apache.avro.file.DataFileWriter.writeBlock(DataFileWriter.java:348)
 at org.apache.avro.file.DataFileWriter.writeIfBlockFull(DataFileWriter.java:295)
 at org.apache.avro.file.DataFileWriter.append(DataFileWriter.java:266)
 at org.apache.avro.tool.DataFileWriteTool.run(DataFileWriteTool.java:109)
 at org.apache.avro.tool.Main.run(Main.java:80)
 at org.apache.avro.tool.Main.main(Main.java:69)
Caused by: java.lang.UnsatisfiedLinkError: no snappyjava in java.library.path
 at java.lang.ClassLoader.loadLibrary(ClassLoader.java:1860)
 at java.lang.Runtime.loadLibrary0(Runtime.java:845)
 at java.lang.System.loadLibrary(System.java:1084)
 at org.xerial.snappy.SnappyNativeLoader.loadLibrary(SnappyNativeLoader.java:52)
 ... 15 more
Exception in thread "main" org.xerial.snappy.SnappyError: [FAILED_TO_LOAD_NATIVE_LIBRARY] null
 at org.xerial.snappy.SnappyLoader.load(SnappyLoader.java:229)
 at org.xerial.snappy.Snappy.<clinit>(Snappy.java:44)
 at org.apache.avro.file.SnappyCodec.compress(SnappyCodec.java:43)
 at org.apache.avro.file.DataFileStream$DataBlock.compressUsing(DataFileStream.java:349)
 at org.apache.avro.file.DataFileWriter.writeBlock(DataFileWriter.java:348)
 at org.apache.avro.file.DataFileWriter.writeIfBlockFull(DataFileWriter.java:295)
 at org.apache.avro.file.DataFileWriter.append(DataFileWriter.java:266)
 at org.apache.avro.tool.DataFileWriteTool.run(DataFileWriteTool.java:109)
 at org.apache.avro.tool.Main.run(Main.java:80)
 at org.apache.avro.tool.Main.main(Main.java:69)

Where to go from here

The example commands above show just a few variants of how to use Avro Tools to read, write and convert Avro files. The Avro Tools library is documented at:
That said I found those docs not that helpful (the sources are however). I’d recommend to just try running the command line tools without parameters and have a look at their usage instructions which they will print to STDOUT. Normally this is enough to understand how they should be used.

No comments:

Post a Comment

Recent Post

Databricks Delta table merge Example

here's some sample code that demonstrates a merge operation on a Delta table using PySpark:   from pyspark.sql import SparkSession # cre...