Encodes the given data with snappy if xerial_compatible is set then the stream is encoded in a fashion compatible with the xerial snappy library
The block size (xerial_blocksize) controls how frequent the blocking occurs 32k is the default in the xerial library.
|
Block1 len | Block1 data | Blockn len |
|
BE int32 | snappy bytes | BE int32 |
It is important to not that the blocksize is the amount of uncompressed data presented to snappy at each block, whereas the blocklen is the number of bytes that will be present in the stream, that is the length will always be <= blocksize.
BrokerMetadata(nodeId, host, port)
Alias for field number 1
Alias for field number 0
Alias for field number 2
FetchRequest(topic, partition, offset, max_bytes)
Alias for field number 3
Alias for field number 2
Alias for field number 1
Alias for field number 0
FetchResponse(topic, partition, error, highwaterMark, messages)
Alias for field number 2
Alias for field number 3
Alias for field number 4
Alias for field number 1
Alias for field number 0
KafkaMessage(topic, partition, offset, key, value)
Alias for field number 3
Alias for field number 2
Alias for field number 1
Alias for field number 0
Alias for field number 4
Message(magic, attributes, key, value)
Alias for field number 1
Alias for field number 2
Alias for field number 0
Alias for field number 3
MetadataResponse(brokers, topics)
Alias for field number 0
Alias for field number 1
OffsetAndMessage(offset, message)
Alias for field number 1
Alias for field number 0
OffsetCommitRequest(topic, partition, offset, metadata)
Alias for field number 3
Alias for field number 2
Alias for field number 1
Alias for field number 0
OffsetCommitResponse(topic, partition, error)
Alias for field number 2
Alias for field number 1
Alias for field number 0
OffsetFetchRequest(topic, partition)
Alias for field number 1
Alias for field number 0
OffsetFetchResponse(topic, partition, offset, metadata, error)
Alias for field number 4
Alias for field number 3
Alias for field number 2
Alias for field number 1
Alias for field number 0
OffsetRequest(topic, partition, time, max_offsets)
Alias for field number 3
Alias for field number 1
Alias for field number 2
Alias for field number 0
OffsetResponse(topic, partition, error, offsets)
Alias for field number 2
Alias for field number 3
Alias for field number 1
Alias for field number 0
PartitionMetadata(topic, partition, leader, replicas, isr, error)
Alias for field number 5
Alias for field number 4
Alias for field number 2
Alias for field number 1
Alias for field number 3
Alias for field number 0
ProduceRequest(topic, partition, messages)
Alias for field number 2
Alias for field number 1
Alias for field number 0
ProduceResponse(topic, partition, error, offset)
Alias for field number 2
Alias for field number 3
Alias for field number 1
Alias for field number 0
TopicAndPartition(topic, partition)
Alias for field number 1
Alias for field number 0
TopicMetadata(topic, error, partitions)
Alias for field number 1
Alias for field number 2
Alias for field number 0
A socket connection to a single Kafka broker
This class is _not_ thread safe. Each call to send must be followed by a call to recv in order to get the correct response. Eventually, we can do something in here to facilitate multiplexed requests/responses since the Kafka API includes a correlation id.
host: the host name or IP address of a kafka broker port: the port number the kafka broker is listening on timeout: default 120. The socket timeout for sending and receiving data
in seconds. None means no timeout, so a request can block forever.
Create an inactive copy of the connection object A reinit() has to be done on the copy before it can be used again return a new KafkaConnection object
Get a response packet from Kafka
Collects a comma-separated set of hosts (host:port) and optionally randomize the returned list.
Context manager to commit/rollback consumer offsets.
Provides commit/rollback semantics around a SimpleConsumer.
Usage assumes that auto_commit is disabled, that messages are consumed in batches, and that the consuming process will record its own successful processing of each message. Both the commit and rollback operations respect a “high-water mark” to ensure that last unsuccessfully processed message will be retried.
Example:
consumer = SimpleConsumer(client, group, topic, auto_commit=False)
consumer.provide_partition_info()
consumer.fetch_last_known_offsets()
while some_condition:
with OffsetCommitContext(consumer) as context:
messages = consumer.get_messages(count, block=False)
for partition, message in messages:
if can_process(message):
context.mark(partition, message.offset)
else:
break
if not context:
sleep(delay)
These semantics allow for deferred message processing (e.g. if can_process compares message time to clock time) and for repeated processing of the last unsuccessful message (until some external error is resolved).
Commit this context’s offsets:
- If the high-water mark has moved, commit up to and position the consumer at the high-water mark.
- Otherwise, reset to the consumer to the initial offsets.
Handle out of range condition by seeking to the beginning of valid ranges.
This assumes that an out of range doesn’t happen by seeking past the end of valid ranges – which is far less likely.
Class to encapsulate all of the protocol encoding/decoding. This class does not have any state associated with it, it is purely for organization.
Decode bytes to a FetchResponse
Decode bytes to a MetadataResponse
Decode bytes to an OffsetCommitResponse
Decode bytes to an OffsetFetchResponse
Decode bytes to an OffsetResponse
Decode bytes to a ProduceResponse
Encodes some FetchRequest structs
client_id: string correlation_id: int payloads: list of FetchRequest max_wait_time: int, how long to block waiting on min_bytes of data min_bytes: int, the minimum number of bytes to accumulate before
returning the response
Encode a MetadataRequest
Encode some OffsetCommitRequest structs
Encode some OffsetFetchRequest structs
Encode some ProduceRequest structs
client_id: string correlation_id: int payloads: list of ProduceRequest acks: How “acky” you want the request to be
0: immediate response 1: written to disk by the leader 2+: waits for this many number of replicas to sync -1: waits for all replicas to be in sync
Construct a Gzipped Message containing multiple Messages
The given payloads will be encoded, compressed, and sent as a single atomic message to Kafka.
Construct a Message
Create a message set using the given codec.
If codec is CODEC_NONE, return a list of raw Kafka messages. Otherwise, return a list containing a single codec-encoded message.
Construct a Snappy Message containing multiple Messages
The given payloads will be encoded, compressed, and sent as a single atomic message to Kafka.
A timer that can be restarted, unlike threading.Timer (although this uses threading.Timer)
Arguments:
t: timer interval in milliseconds fn: a callable to invoke args: tuple of args to be passed to function kwargs: keyword arguments to be passed to function
Base class to be used by other consumers. Not to be used directly
This base class provides logic for
A simpler kafka consumer
# A very basic 'tail' consumer, with no stored offset management
kafka = KafkaConsumer('topic1')
for m in kafka:
print m
# Alternate interface: next()
print kafka.next()
# Alternate interface: batch iteration
while True:
for m in kafka.fetch_messages():
print m
print "Done with batch - let's do another!"
# more advanced consumer -- multiple topics w/ auto commit offset management
kafka = KafkaConsumer('topic1', 'topic2',
group_id='my_consumer_group',
auto_commit_enable=True,
auto_commit_interval_ms=30 * 1000,
auto_offset_reset='smallest')
# Infinite iteration
for m in kafka:
process_message(m)
kafka.task_done(m)
# Alternate interface: next()
m = kafka.next()
process_message(m)
kafka.task_done(m)
# If auto_commit_enable is False, remember to commit() periodically
kafka.commit()
# Batch process interface
while True:
for m in kafka.fetch_messages():
process_message(m)
kafka.task_done(m)
messages (m) are namedtuples with attributes:
- m.topic: topic name (str)
- m.partition: partition number (int)
- m.offset: message offset on topic-partition log (int)
- m.key: key (bytes - can be None)
- m.value: message (output of deserializer_class - default is raw bytes)
Configuration settings can be passed to constructor, otherwise defaults will be used:
client_id='kafka.consumer.kafka',
group_id=None,
fetch_message_max_bytes=1024*1024,
fetch_min_bytes=1,
fetch_wait_max_ms=100,
refresh_leader_backoff_ms=200,
metadata_broker_list=None,
socket_timeout_ms=30*1000,
auto_offset_reset='largest',
deserializer_class=lambda msg: msg,
auto_commit_enable=False,
auto_commit_interval_ms=60 * 1000,
consumer_timeout_ms=-1
Configuration parameters are described in more detail at http://kafka.apache.org/documentation.html#highlevelconsumerapi
Store consumed message offsets (marked via task_done()) to kafka cluster for this consumer_group.
Note: this functionality requires server version >=0.8.1.1 See this wiki page.
Configuration settings can be passed to constructor, otherwise defaults will be used:
client_id='kafka.consumer.kafka',
group_id=None,
fetch_message_max_bytes=1024*1024,
fetch_min_bytes=1,
fetch_wait_max_ms=100,
refresh_leader_backoff_ms=200,
metadata_broker_list=None,
socket_timeout_ms=30*1000,
auto_offset_reset='largest',
deserializer_class=lambda msg: msg,
auto_commit_enable=False,
auto_commit_interval_ms=60 * 1000,
auto_commit_interval_messages=None,
consumer_timeout_ms=-1
Configuration parameters are described in more detail at http://kafka.apache.org/documentation.html#highlevelconsumerapi
Sends FetchRequests for all topic/partitions set for consumption Returns a generator that yields KafkaMessage structs after deserializing with the configured deserializer_class
Refreshes metadata on errors, and resets fetch offset on OffsetOutOfRange, per the configured auto_offset_reset policy
Key configuration parameters:
Request available fetch offsets for a single topic/partition
topic (str) partition (int) request_time_ms (int): Used to ask for all messages before a
certain time (ms). There are two special values. Specify -1 to receive the latest offset (i.e. the offset of the next coming message) and -2 to receive the earliest available offset. Note that because offsets are pulled in descending order, asking for the earliest offset will always return you a single element.
max_num_offsets (int)
Return a single message from the message iterator If consumer_timeout_ms is set, will raise ConsumerTimeout if no message is available Otherwise blocks indefinitely
Note that this is also the method called internally during iteration:
for m in consumer:
pass
Set the topic/partitions to consume Optionally specify offsets to start from
Accepts types:
str (utf-8): topic name (will consume all available partitions)
tuple: (topic, partition)
Optionally, offsets can be specified directly:
Example:
kafka = KafkaConsumer()
# Consume topic1-all; topic2-partition2; topic3-partition0
kafka.set_topic_partitions("topic1", ("topic2", 2), {"topic3": 0})
# Consume topic1-0 starting at offset 123, and topic2-1 at offset 456
# using tuples --
kafka.set_topic_partitions(("topic1", 0, 123), ("topic2", 1, 456))
# using dict --
kafka.set_topic_partitions({ ("topic1", 0): 123, ("topic2", 1): 456 })
OffsetsStruct(fetch, highwater, commit, task_done)
Alias for field number 2
Alias for field number 0
Alias for field number 1
Alias for field number 3
A consumer implementation that consumes partitions for a topic in parallel using multiple processes
auto_commit: default True. Whether or not to auto commit the offsets auto_commit_every_n: default 100. How many messages to consume
before a commit
Auto commit details: If both auto_commit_every_n and auto_commit_every_t are set, they will reset one another when one is triggered. These triggers simply call the commit method on this class. A manual call to commit will also reset these triggers
Fetch the specified number of messages
count: Indicates the maximum number of messages to be fetched block: If True, the API will block till some messages are fetched. timeout: If block is True, the function will block for the specified
time (in seconds) until count messages is fetched. If None, it will block forever.
Class for managing the state of a consumer during fetch
A simple consumer implementation that consumes all/specified partitions for a topic
partitions: An optional list of partitions to consume the data from
auto_commit: default True. Whether or not to auto commit the offsets
fetch_size_bytes: number of bytes to request in a FetchRequest
Auto commit details: If both auto_commit_every_n and auto_commit_every_t are set, they will reset one another when one is triggered. These triggers simply call the commit method on this class. A manual call to commit will also reset these triggers
Fetch the specified number of messages
count: Indicates the maximum number of messages to be fetched block: If True, the API will block till some messages are fetched. timeout: If block is True, the function will block for the specified
time (in seconds) until count messages is fetched. If None, it will block forever.
Alter the current offset in the consumer, similar to fseek
offset: how much to modify the offset whence: where to modify it from
- 0 is relative to the earliest available offset (head)
- 1 is relative to the current offset
- 2 is relative to the latest known offset (tail)
Base class for a partitioner
Takes a string key and num_partitions as argument and returns a partition to be used for the message
Base class to be used by producers
client: The Kafka client instance to use async: If set to true, the messages are sent asynchronously via another
thread (process). We will not wait for a response to these WARNING!!! current implementation of async producer does not guarantee message delivery. Use at your own risk! Or help us improve with a PR!
batch_send: If True, messages are send in batches batch_send_every_n: If set, messages are send in batches of this size batch_send_every_t: If set, messages are send after this timeout
Helper method to send produce requests @param: topic, name of topic for produce request – type str @param: partition, partition number for produce request – type int @param: *msg, one or more message payloads – type bytes @returns: ResponseRequest returned by server raises on error
Note that msg type must be encoded to bytes by user. Passing unicode message will not work, for example you should encode before calling send_messages via something like unicode_message.encode(‘utf-8’)
All messages produced via this method will set the message ‘key’ to Null
A producer which distributes messages to partitions based on the key
batch_send: If True, messages are send in batches batch_send_every_n: If set, messages are send in batches of this size batch_send_every_t: If set, messages are send after this timeout
A simple, round-robin producer. Each message goes to exactly one partition
batch_send: If True, messages are send in batches batch_send_every_n: If set, messages are send in batches of this size batch_send_every_t: If set, messages are send after this timeout random_start: If true, randomize the initial partition which the
the first message block will be published to, otherwise if false, the first message block will always publish to partition 0 before cycling through each partition