pept.utilities.ChunkReader#

class pept.utilities.ChunkReader(filepath_or_buffer, chunksize, skiprows=None, nrows=None, dtype=<class 'float'>, sep='\\s+', header=None, engine='c', na_filter=False, quoting=3, memory_map=True, **kwargs)[source]#

Bases: object

Class for fast, on-demand reading / parsing and iteration over chunks of data from CSV files.

This is an abstraction above pandas.read_csv for easy and fast iteration over chunks of data from a CSV file. The chunks can be accessed using normal iteration (for chunk in reader: …) and subscripting (reader[0]).

The chunks are read lazily, only upon access. It is therefore a more efficient alternative to read_csv for large files (> 1.000.000 lines). For convenience, this class configures some default parameters for pandas.read_csv for fast reading and parsing of usual PEPT data.

Most importantly, it reads chunks containing chunksize lines from a space-separated values file at filepath_or_buffer, optionally skipping skiprows lines and reading in at most nrows lines. It returns numpy.ndarray`s with `float values.

Raises

IndexError: Upon access to a non-existent chunk using subscript notation (i.e. data[100] when there are 50 chunks).

See also

pept.utilities.read_csv: Fast CSV file reading into numpy arrays.
pept.LineData: Encapsulate LoRs for ease of iteration and plotting.
pept.PointData: Encapsulate points for ease of iteration and plotting.

Examples

Say “data.csv” contains 1_000_000 lines of data. Read chunks of 10_000 lines as a time, skipping the first 100_000:

>>> from pept.utilities import ChunkReader
>>> chunks = ChunkReader("data.csv", 10_000, skiprows = 100_000)
>>> len(chunks)         # 90 chunks
>>> chunks.file_lines   # 1_000_000

Normal iteration:

>>> for chunk in chunks:
>>>     ... # neat operations

Access a single chunk using subscripting:

>>> chunks[0]   # First chunk
>>> chunks[-1]  # Last chunk
>>> chunks[100] # IndexError

Attributes

filepath_or_bufferstr, path object or file-like object: Any valid string path is acceptable. The string could be a URL. Valid URL schemes include http, ftp, s3, and file. For file URLs, a host is expected. A local file could be file://localhost/path/to/table.csv. If you want to pass in a path object, pandas accepts any os.PathLike. By file-like object, we refer to objects with a read() method, such as a file handler (e.g. via builtin open function) or StringIO.
number_of_chunksint: The number of chunks (also returned when using the len method), taking into account the lines skipped (skiprows), the number of lines in the file (file_lines) and the maximum number of lines to be read (nrows).
file_linesint: The number of lines in the file pointed at by filepath_or_buffer.
chunksizeint: The number of lines in a chunk of data.
skiprowsint: The number of lines to be skipped at the beginning of the file.
nrowsint: The maximum number of lines to be read. Only has an effect if it is less than file_lines - skiprows. For example, if a file has 10 lines and skiprows = 5 and chunksize = 5, even if nrows were to be 20, the number_of_chunks should still be 1.

__init__(filepath_or_buffer, chunksize, skiprows=None, nrows=None, dtype=<class 'float'>, sep='\\s+', header=None, engine='c', na_filter=False, quoting=3, memory_map=True, **kwargs)[source]#

ChunkReader class constructor.

Parameters

filepath_or_bufferstr, path object or file-like object: Any valid string path to a local file is acceptable. If you want to read in lines from an online location (i.e. using a URL), you should use pept.utilities.read_csv. If you want to pass in a path object, pandas accepts any os.PathLike. By file-like object, we refer to objects with a read() method, such as a file handler (e.g. via builtin open function) or StringIO.
chunksizeint: Number of lines read in a chunk of data.
skiprowslist-like, int or callable(), optional: Line numbers to skip (0-indexed) or number of lines to skip (int) at the start of the file.
nrowsint, optional: Number of rows of file to read. Useful for reading pieces of large files.
dtypeType name, default float: Data type for data or columns. E.g. {‘a’: np.float64, ‘b’: np.int32, ‘c’: ‘Int64’}.
sepstr, default “s+”: Delimiter to use. Separators longer than 1 character and different from ‘s+’ will be interpreted as regular expressions and will also force the use of the Python parsing engine.
headerint, list of int, “infer”, optional: Row number(s) to use as the column names, and the start of the data. By default assume there is no header present (i.e. header = None).
engine{‘c’, ‘python’}, default “c”: Parser engine to use. The C engine is faster while the python engine is currently more feature-complete.
na_filterbool, default True: Detect missing value markers (empty strings and the value of na_values). In data without any NAs, passing na_filter=False can improve the performance of reading a large file.
quotingint or csv.QUOTE_* instance, default csv.QUOTE_NONE: Control field quoting behavior per csv.QUOTE_* constants. Use one of QUOTE_MINIMAL (0), QUOTE_ALL (1), QUOTE_NONNUMERIC (2) or QUOTE_NONE (3).
memory_mapbool, default True: If a filepath is provided for filepath_or_buffer, map the file object directly onto memory and access the data directly from there. Using this option can improve performance because there is no longer any I/O overhead.
**kwargsoptional: Extra keyword arguments that will be passed to pandas.read_csv.

Raises

EOFErrorEnd Of File Error: If skiprows >= number_of_lines.

Methods

__init__(filepath_or_buffer, chunksize[, ...])

ChunkReader class constructor.

Attributes

`chunksize`
`file_lines`
`nrows`
`number_of_chunks`
`skiprows`

property number_of_chunks#

property file_lines#

property chunksize#

property skiprows#

property nrows#