Apache Avro is becoming one of the most popular data serialization formats nowadays, and this holds true particularly for Hadoop-based big data platforms because tools like Pig, Hive and of course Hadoop itself natively support reading and writing data in Avro format. Many users seem to enjoy Avro but I have heard many complaints about not being able to conveniently read or write Avro files with command line tools – “Avro is nice, but why do I have to write Java or Python code just to quickly see what’s in a binary Avro file, or discover at least its Avro schema?”
To those users it comes as a surprise that Avro actually ships with exactly such command line tools but apparently they are not prominently advertised or documented as such. In this short article I will show a few hands-on examples on how to read, write, compress and convert data from and to binary Avro using Avro Tools 1.7.4.
Here is an overview of what we want to do:
What we want to do
- We will start with an example Avro schema and a corresponding data file in plain-text JSON format.
- We will use Avro Tools to convert the JSON file into binary Avro, without and with compression (Snappy), and from binary Avro back to JSON.
Getting Avro Tools
You can get a copy of the latest stable Avro Tools jar file from the Avro Releases page. The actual file is in the
java
subdirectory of a given Avro release version. Here is a direct link to avro-tools-1.7.4.jar (11 MB) on the US Apache mirror site.
Save
avro-tools-1.7.4.jar
to a directory of your choice. I will use ~/avro-tools-1.7.4.jar
for the examples shown below.Tools included in Avro Tools
Just run Avro Tools without any parameters to see what’s included:
$ java -jar ~/avro-tools-1.7.4.jar
[...snip...]
Available tools:
compile Generates Java code for the given schema.
concat Concatenates avro files without re-compressing.
fragtojson Renders a binary-encoded Avro datum as JSON.
fromjson Reads JSON records and writes an Avro data file.
fromtext Imports a text file into an avro data file.
getmeta Prints out the metadata of an Avro data file.
getschema Prints out schema of an Avro data file.
idl Generates a JSON schema from an Avro IDL file
induce Induce schema/protocol from Java class/interface via reflection.
jsontofrag Renders a JSON-encoded Avro datum as binary.
recodec Alters the codec of a data file.
rpcprotocol Output the protocol of a RPC service
rpcreceive Opens an RPC Server and listens for one message.
rpcsend Sends a single RPC message.
tether Run a tethered mapreduce job.
tojson Dumps an Avro data file as JSON, one record per line.
totext Converts an Avro data file to a text file.
trevni_meta Dumps a Trevni file's metadata as JSON.
trevni_random Create a Trevni file filled with random instances of a schema.
trevni_tojson Dumps a Trevni file as JSON.
Likewise run any particular tool without parameters to see its usage/help output. For example, here is the help of the
fromjson
tool:$ java -jar ~/avro-tools-1.7.4.jar fromjson
Expected 1 arg: input_file
Option Description
------ -----------
--codec Compression codec (default: null)
--schema Schema
--schema-file Schema File
Note that most of the tools write to
STDOUT
, so normally you would like to pipe their output to a file via the Bash >
redirection operator (particularly when working with large files).Avro schema
The schema below defines a tuple of
(username, tweet, timestamp)
as the format of our example data records.
File:
twitter.avsc
:{
"type" : "record",
"name" : "twitter_schema",
"namespace" : "com.miguno.avro",
"fields" : [ {
"name" : "username",
"type" : "string",
"doc" : "Name of the user account on Twitter.com"
}, {
"name" : "tweet",
"type" : "string",
"doc" : "The content of the user's Twitter message"
}, {
"name" : "timestamp",
"type" : "long",
"doc" : "Unix epoch time in seconds"
} ],
"doc:" : "A basic schema for storing Twitter messages"
}
Data records in JSON format
And here is some corresponding example data with two records that follow the schema defined in the previous section. We store this data in the file
twitter.json
.
Example data in
twitter.json
in JSON format:{"username":"miguno","tweet":"Rock: Nerf paper, scissors is fine.","timestamp": 1366150681 }
{"username":"BlizzardCS","tweet":"Works as intended. Terran is IMBA.","timestamp": 1366154481 }
Converting to and from binary Avro
JSON to binary Avro
Without compression:
$ java -jar ~/avro-tools-1.7.4.jar fromjson --schema-file twitter.avsc twitter.json > twitter.avro
With Snappy compression:
$ java -jar ~/avro-tools-1.7.4.jar fromjson --codec snappy --schema-file twitter.avsc twitter.json > twitter.snappy.avro
Note for Mac OS X users: If you run into SnappyError: [FAILED_TO_LOAD_NATIVE_LIBRARY]when trying to compress the data with Snappy make sure you use JDK 6 and not JDK 7.
Binary Avro to JSON
The same command will work on both uncompressed and compressed data.
$ java -jar ~/avro-tools-1.7.4.jar tojson twitter.avro > twitter.json
$ java -jar ~/avro-tools-1.7.4.jar tojson twitter.snappy.avro > twitter.json
Note for Mac OS X users: If you run into SnappyError: [FAILED_TO_LOAD_NATIVE_LIBRARY]when trying to decompress the data with Snappy make sure you use JDK 6 and not JDK 7.
Retrieve Avro schema from binary Avro
The same command will work on both uncompressed and compressed data.
$ java -jar ~/avro-tools-1.7.4.jar getschema twitter.avro > twitter.avsc
$ java -jar ~/avro-tools-1.7.4.jar getschema twitter.snappy.avro > twitter.avsc
Known Issues of Snappy with JDK 7 on Mac OS X
If you happen to use JDK 7 on Mac OS X 10.8 you will run into the error below when trying to run the Snappy related commands above. In that case make sure to explicitly use JDK 6. On Mac OS 10.8 the JDK 6
java
binary is by default available at /System/Library/Java/JavaVirtualMachines/1.6.0.jdk/Contents/Home/bin/java.
The cause of this problem is documented in the bug report Native (Snappy) library loading fails on openjdk7u4 for mac. This bug is already fixed in the latest Snappy-Java 1.5 milestone releases, but Avro 1.7.4 still depends on the latest stable release of Snappy-Java which is 1.0.4.1 (see
lang/java/pom.xml
in the Avro source code).
I also found that one way to fix this problem when writing your own Java code is to explicitly require Snappy 1.5.x. Here is the relevant dependency declaration for
build.gradle
in case you are using Gradle. This seems to solve the problem, but I have yet to confirm whether this is a safe way for production scenarios.// Required to fix a Snappy native library error on OS X when trying to compress Avro files with Snappy;
// Avro 1.7.4 uses the latest stable release of Snappy, 1.0.4.1 (see avro/lang/java/pom.xml) that still contains
// the original bug described at https://github.com/xerial/snappy-java/issues/6.
//
// Note that in a production setting we do not care about OS X, so we could use Snappy 1.0.4.1 as required by
// Avro 1.7.4 as is.
//
compile group: 'org.xerial.snappy', name: 'snappy-java', version: '1.0.5-M4'
Detailed error message:
$ uname -a
Darwin mac.local 12.3.0 Darwin Kernel Version 12.3.0: Sun Jan 6 22:37:10 PST 2013; root:xnu-2050.22.13~1/RELEASE_X86_64 x86_64
$ java -version
java version "1.7.0_17"
Java(TM) SE Runtime Environment (build 1.7.0_17-b02)
Java HotSpot(TM) 64-Bit Server VM (build 23.7-b01, mixed mode)
$ java -jar ~/avro-tools-1.7.4.jar fromjson --codec snappy --schema-file twitter.avsc twitter.json > twitter.snappy.avro
java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:601)
at org.xerial.snappy.SnappyLoader.loadNativeLibrary(SnappyLoader.java:317)
at org.xerial.snappy.SnappyLoader.load(SnappyLoader.java:219)
at org.xerial.snappy.Snappy.<clinit>(Snappy.java:44)
at org.apache.avro.file.SnappyCodec.compress(SnappyCodec.java:43)
at org.apache.avro.file.DataFileStream$DataBlock.compressUsing(DataFileStream.java:349)
at org.apache.avro.file.DataFileWriter.writeBlock(DataFileWriter.java:348)
at org.apache.avro.file.DataFileWriter.writeIfBlockFull(DataFileWriter.java:295)
at org.apache.avro.file.DataFileWriter.append(DataFileWriter.java:266)
at org.apache.avro.tool.DataFileWriteTool.run(DataFileWriteTool.java:109)
at org.apache.avro.tool.Main.run(Main.java:80)
at org.apache.avro.tool.Main.main(Main.java:69)
Caused by: java.lang.UnsatisfiedLinkError: no snappyjava in java.library.path
at java.lang.ClassLoader.loadLibrary(ClassLoader.java:1860)
at java.lang.Runtime.loadLibrary0(Runtime.java:845)
at java.lang.System.loadLibrary(System.java:1084)
at org.xerial.snappy.SnappyNativeLoader.loadLibrary(SnappyNativeLoader.java:52)
... 15 more
Exception in thread "main" org.xerial.snappy.SnappyError: [FAILED_TO_LOAD_NATIVE_LIBRARY] null
at org.xerial.snappy.SnappyLoader.load(SnappyLoader.java:229)
at org.xerial.snappy.Snappy.<clinit>(Snappy.java:44)
at org.apache.avro.file.SnappyCodec.compress(SnappyCodec.java:43)
at org.apache.avro.file.DataFileStream$DataBlock.compressUsing(DataFileStream.java:349)
at org.apache.avro.file.DataFileWriter.writeBlock(DataFileWriter.java:348)
at org.apache.avro.file.DataFileWriter.writeIfBlockFull(DataFileWriter.java:295)
at org.apache.avro.file.DataFileWriter.append(DataFileWriter.java:266)
at org.apache.avro.tool.DataFileWriteTool.run(DataFileWriteTool.java:109)
at org.apache.avro.tool.Main.run(Main.java:80)
at org.apache.avro.tool.Main.main(Main.java:69)
Where to go from here
The example commands above show just a few variants of how to use Avro Tools to read, write and convert Avro files. The Avro Tools library is documented at:
That said I found those docs not that helpful (the sources are however). I’d recommend to just try running the command line tools without parameters and have a look at their usage instructions which they will print to
STDOUT
. Normally this is enough to understand how they should be used.
No comments:
Post a Comment