pept.utilities.ChunkReader#

class pept.utilities.ChunkReader(filepath_or_buffer, chunksize, skiprows=None, nrows=None, dtype=<class 'float'>, sep='\\s+', header=None, engine='c', na_filter=False, quoting=3, memory_map=True, **kwargs)[source]#

Bases: object

Class for fast, on-demand reading / parsing and iteration over chunks of data from CSV files.

This is an abstraction above pandas.read_csv for easy and fast iteration over chunks of data from a CSV file. The chunks can be accessed using normal iteration (for chunk in reader: …) and subscripting (reader[0]).

The chunks are read lazily, only upon access. It is therefore a more efficient alternative to read_csv for large files (> 1.000.000 lines). For convenience, this class configures some default parameters for pandas.read_csv for fast reading and parsing of usual PEPT data.

Most importantly, it reads chunks containing chunksize lines from a space-separated values file at filepath_or_buffer, optionally skipping skiprows lines and reading in at most nrows lines. It returns numpy.ndarray`s with `float values.

Raises
IndexError

Upon access to a non-existent chunk using subscript notation (i.e. data[100] when there are 50 chunks).

See also

pept.utilities.read_csv

Fast CSV file reading into numpy arrays.

pept.LineData

Encapsulate LoRs for ease of iteration and plotting.

pept.PointData

Encapsulate points for ease of iteration and plotting.

Examples

Say “data.csv” contains 1_000_000 lines of data. Read chunks of 10_000 lines as a time, skipping the first 100_000:

>>> from pept.utilities import ChunkReader
>>> chunks = ChunkReader("data.csv", 10_000, skiprows = 100_000)
>>> len(chunks)         # 90 chunks
>>> chunks.file_lines   # 1_000_000

Normal iteration:

>>> for chunk in chunks:
>>>     ... # neat operations

Access a single chunk using subscripting:

>>> chunks[0]   # First chunk
>>> chunks[-1]  # Last chunk
>>> chunks[100] # IndexError
Attributes
filepath_or_bufferstr, path object or file-like object

Any valid string path is acceptable. The string could be a URL. Valid URL schemes include http, ftp, s3, and file. For file URLs, a host is expected. A local file could be file://localhost/path/to/table.csv. If you want to pass in a path object, pandas accepts any os.PathLike. By file-like object, we refer to objects with a read() method, such as a file handler (e.g. via builtin open function) or StringIO.

number_of_chunksint

The number of chunks (also returned when using the len method), taking into account the lines skipped (skiprows), the number of lines in the file (file_lines) and the maximum number of lines to be read (nrows).

file_linesint

The number of lines in the file pointed at by filepath_or_buffer.

chunksizeint

The number of lines in a chunk of data.

skiprowsint

The number of lines to be skipped at the beginning of the file.

nrowsint

The maximum number of lines to be read. Only has an effect if it is less than file_lines - skiprows. For example, if a file has 10 lines and skiprows = 5 and chunksize = 5, even if nrows were to be 20, the number_of_chunks should still be 1.

__init__(filepath_or_buffer, chunksize, skiprows=None, nrows=None, dtype=<class 'float'>, sep='\\s+', header=None, engine='c', na_filter=False, quoting=3, memory_map=True, **kwargs)[source]#

ChunkReader class constructor.

Parameters
filepath_or_bufferstr, path object or file-like object

Any valid string path to a local file is acceptable. If you want to read in lines from an online location (i.e. using a URL), you should use pept.utilities.read_csv. If you want to pass in a path object, pandas accepts any os.PathLike. By file-like object, we refer to objects with a read() method, such as a file handler (e.g. via builtin open function) or StringIO.

chunksizeint

Number of lines read in a chunk of data.

skiprowslist-like, int or callable(), optional

Line numbers to skip (0-indexed) or number of lines to skip (int) at the start of the file.

nrowsint, optional

Number of rows of file to read. Useful for reading pieces of large files.

dtypeType name, default float

Data type for data or columns. E.g. {‘a’: np.float64, ‘b’: np.int32, ‘c’: ‘Int64’}.

sepstr, default “s+”

Delimiter to use. Separators longer than 1 character and different from ‘s+’ will be interpreted as regular expressions and will also force the use of the Python parsing engine.

headerint, list of int, “infer”, optional

Row number(s) to use as the column names, and the start of the data. By default assume there is no header present (i.e. header = None).

engine{‘c’, ‘python’}, default “c”

Parser engine to use. The C engine is faster while the python engine is currently more feature-complete.

na_filterbool, default True

Detect missing value markers (empty strings and the value of na_values). In data without any NAs, passing na_filter=False can improve the performance of reading a large file.

quotingint or csv.QUOTE_* instance, default csv.QUOTE_NONE

Control field quoting behavior per csv.QUOTE_* constants. Use one of QUOTE_MINIMAL (0), QUOTE_ALL (1), QUOTE_NONNUMERIC (2) or QUOTE_NONE (3).

memory_mapbool, default True

If a filepath is provided for filepath_or_buffer, map the file object directly onto memory and access the data directly from there. Using this option can improve performance because there is no longer any I/O overhead.

**kwargsoptional

Extra keyword arguments that will be passed to pandas.read_csv.

Raises
EOFErrorEnd Of File Error

If skiprows >= number_of_lines.

Methods

__init__(filepath_or_buffer, chunksize[, ...])

ChunkReader class constructor.

Attributes

chunksize

file_lines

nrows

number_of_chunks

skiprows

property number_of_chunks#
property file_lines#
property chunksize#
property skiprows#
property nrows#